Contents

1 Install required packages

Install bigWig and latticeExtra package

install.packages("devtools", quiet = TRUE)
## also installing the dependencies 'Rcpp', 'utf8', 'askpass', 'credentials', 'openssl', 'sys', 'zip', 'gitcreds', 'httr2', 'ini', 'httpuv', 'xtable', 'sourcetools', 'later', 'promises', 'fansi', 'systemfonts', 'textshaping', 'pillar', 'pkgconfig', 'diffobj', 'rematch2', 'clipr', 'crayon', 'curl', 'gert', 'gh', 'purrr', 'rprojroot', 'rstudioapi', 'whisker', 'shiny', 'callr', 'processx', 'downlit', 'httr', 'ragg', 'tibble', 'xml2', 'htmlwidgets', 'prettyunits', 'xopen', 'brew', 'commonmark', 'cpp11', 'brio', 'praise', 'ps', 'waldo', 'usethis', 'desc', 'miniUI', 'pkgbuild', 'pkgdown', 'pkgload', 'profvis', 'rcmdcheck', 'remotes', 'roxygen2', 'rversions', 'sessioninfo', 'testthat', 'urlchecker', 'withr'
library(devtools)
## Loading required package: usethis
devtools::install_github('andrelmartins/bigWig',
              subdir='bigWig')
## Downloading GitHub repo andrelmartins/bigWig@HEAD
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/tmp/RtmpCfYOb9/remotes1517f3befa4/andrelmartins-bigWig-beac5a7/bigWig/DESCRIPTION’ ... OK
## * preparing ‘bigWig’:
## * checking DESCRIPTION meta-information ... OK
## * cleaning src
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘bigWig_0.2-9.tar.gz’
library(bigWig)

install.packages("latticeExtra", quiet = TRUE)
## also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'RColorBrewer', 'interp'
install.packages("DESeq2", quiet = TRUE)
## Warning: package 'DESeq2' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

Install bedtools

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install bedtools

2 Jan 2nd

2.1 Summary of Exhaustive MEME/STREM/MAST analysis:

We begin by using a 101bp window centered around the peak summit and employ MEME/STREME software to identify potential GATA3 binding sites. MEME (classic) identified six GATA3 motifs within the GATA3 ChIP peak set. Subsequent MEME analysis revealed no additional GATA3-matched motifs in peaks without the initial 6 motifs discovered by MEME.
On the other hand, STREME detected two GATA3 motifs in peaks where the 6 MEME-found motifs were absent.

We then expanded the search window from 101bp to 161bp and used MEME and STREME to identify potential binding sites within the full 161bp window and the spanned region only (with the central 81bp swapped to N).

With STREME, we discovered two motifs within the spanned region. These motifs resembled the previously found motifs within the 101bp window. As a result, we concluded that the 101bp window was insufficient for capturing all potential GATA3 binding sites.

Subsequently, we employed MAST to locate peaks containing the previous 8 motifs within the 161bp window. For peaks that didn’t overlap with any motif coordinates, we increased their width from 161bp to 181bp and conducted the same MEME/STREME analysis. However, this time, no GATA3-like motifs were found.

In my GATA3 peak set, GATA3 appears to preferentially bind to the central 161bp window. Despite this, MEME/STREME failed to identify GATA3 binding elements in 38.5% of the peaks.

2.2 Motif Occurence in GATA peaks

stack bar plot
a bar plot showing peaks contain GATA-like motifs, and other peaks

wc -l  *round*.bed
# 11427 mast_GATA3_PSWM_in_peaks_round1.bed
#  10460 mast_GATA3_PSWM_in_peaks_round2.bed
#   5202 mast_GATA3_PSWM_in_peaks_round3.bed
#   5665 mast_GATA3_PSWM_in_peaks_round4.bed
#   3564 mast_GATA3_PSWM_in_peaks_round5.bed
#   4510 mast_GATA3_PSWM_in_peaks_round6.bed
#   8733 mast_GATA3_PSWM_in_peaks_round7.bed
#   4452 mast_GATA3_PSWM_in_peaks_round8.bed
#   6002 mast_GATA3_PSWM_in_peaks_round9.bed --161bp window
#  60015 total

   
counts=$(wc -l without_motifs_123456_78_161bp_mast.bed | awk -F"without" '{print $1}')  #37308
label1='others'
echo $counts "" $label1 >> bar.csv
total_counts=$(wc -l /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/GATA_ChIP_summit_100window.bed | awk -F"/" '{print $1}')  #96868
counts2=$(echo "$total_counts - $counts" | bc)
label2='GATA3-like'
echo $counts2 "" $label2 >> bar.csv
#module load R/4.1.2 
#R 

library("lattice") 
bar=read.csv('bar.csv', sep = "", header=F)
colnames(bar)=c("peak_numbers", "enriched_motifs")
bar$dum.x="peak number"
bar$enriched_motif<- factor(bar$enriched_motif, levels = c("GATA3-like","others"))

pdf('240103_peak_number_with_or_without_GATA3_motif_variant.pdf', width=5,height=6)
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c("red","light grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,         
         data = bar,
         groups = enriched_motif,
         stack = TRUE,
         auto.key=list(space="right"),
         #scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "number of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()

png('240103_peak_number_with_or_without_GATA3_motif_variant.png')
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c("red","light grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(peak_numbers ~ dum.x,         
         data = bar,
         groups = enriched_motif,
         stack = TRUE,
         auto.key=list(space="right"),
         #scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "number of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif variants

Figure 1: percent peaks with or without GATA3 motif variants

bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:

# quantile file is 1bp summit file, expanded to 161window
#module load R/4.1.2 
#R
library(bigWig)

for (chip.peak in Sys.glob(file.path("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/quantile*summits.bed")))  {
    print(chip.peak)
    quantile.name = strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], '_summits.bed')[[1]][1]
    print(quantile.name)
    chip.peaks=read.table(chip.peak, header=FALSE)
    chip.peak.161win=center.bed(chip.peaks, upstreamWindow=80, downstreamWindow=80)
    write.table(chip.peak.161win,file= paste0(quantile.name, '_summits_161window.bed'), quote=F,sep="\t",col.names=F,row.names=F)
    }
module load bedtools
dir="/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_Intensity/GATA_Deseq2/"

# quantile peaks with individual motif
for i in ${dir}*quantile*161window.bed
do
  nm=$(echo $i | awk -F $dir '{print $2}' | awk -F "_summits_161window.bed" '{print $1}')
  intersectBed -wa -a $i -b ../without_motifs_123456_78_161bp_mast.bed | sort -k1,1 -k2,2n | uniq >${nm}_without_motifs_123456_78_161bp_mast.bed
done


touch quantile.240103.sum.161window.txt
for i in *quantile*_without_motifs*.bed
do
 nm=$(echo $i | awk -F"_without_motifs_123456_78_161bp_mast.bed" '{print $1}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l ${dir}${nm}_summits.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#module load R/4.1.2 
#R 

library("lattice") 
library("reshape2")


df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum

df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))

pdf('240103_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()

png('240103_percent_peaks_withorwithout_GATA3_motif_variant.png')
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8,9):
percent peaks with or without GATA3 motif variants

Figure 2: percent peaks with or without GATA3 motif variants

For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.

2.3 ENCODE DHS in MCF7 cells

2.3.1 Retrive the ENCODE DNAse-seq data for MCF7 cells (Refer to GSE29692)

Stam_MCF-7_1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736581
Stam_MCF-7_2: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736588

IMPORTANT
1) Notice that these data are using hg19 genome, we need to use UCSC liftover to convert data to hg38.
2) While processing the negative control (DHS regions), we want to remove the 8 motifs first, use same window (161bp) and same mast p value.

# 173277 MCF7-DS12619.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS12619.peaks.fdr0.01.hg19.bed
#chr1   10180   10330   .   0   .   12  5.68855 -1  -1
#chr1   16160   16310   .   0   .   11  4.49825 -1  -1
#chr1   237660  237810  .   0   .   41  33.423  -1  -1
#chr1   521440  521590  .   0   .   66  76.4085 -1  -1
#chr1   565560  565710  .   0   .   91  20.8239 -1  -1

#126717 MCF7-DS9445.peaks.fdr0.01.hg19.bed
head -5 MCF7-DS9445.peaks.fdr0.01.hg19.bed
#chr1   237720  237870  .   0   .   11  7.10762 -1  -1
#chr1   521440  521590  .   0   .   16  15.6231 -1  -1
#chr1   565280  565430  .   0   .   41  11.9262 -1  -1
#chr1   565540  565690  .   0   .   54  39.9645 -1  -1
#chr1   565860  566010  .   0   .   95  39.9645 -1  -1

Process the file, keep only the coordinates info. (liftover has format requirements).

awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS12619.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep2.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7-DS9445.peaks.fdr0.01.hg19.bed > MCF7_hg19_Encode_DHS_Rep1.bed
head -5 MCF7_hg19_Encode_DHS_Rep2.bed #173277
#chr1   10180   10330
#chr1   16160   16310
#chr1   237660  237810
#chr1   521440  521590
#chr1   565560  565710

head -5 MCF7_hg19_Encode_DHS_Rep1.bed #126717
#chr1   237720  237870
#chr1   521440  521590
#chr1   565280  565430
#chr1   565540  565690
#chr1   565860  566010

Then we go to the UCSC liftover online tool to convert the hg19 regions to hg38 regions.
https://genome.ucsc.edu/cgi-bin/hgLiftOver

head -5 MCF7_hg38_Encode_DHS_Rep2.bed # 173226
#chr1   10180   10330   chr1:10181-10330    1
#chr1   16160   16310   chr1:16161-16310    1
#chr1   267909  268059  chr1:237661-237810  1
#chr1   586060  586210  chr1:521441-521590  1
#chr1   630180  630330  chr1:565561-565710  1

head -5 MCF7_hg38_Encode_DHS_Rep1.bed # 126690
#chr1   267969  268119  chr1:237721-237870  1
#chr1   586060  586210  chr1:521441-521590  1
#chr1   629900  630050  chr1:565281-565430  1
#chr1   630160  630310  chr1:565541-565690  1
#chr1   630480  630630  chr1:565861-566010  1

Notice that, during liftover process, there are chance to lose some regions.

Remove the original hg19 coordinates info, and keep only the hg38 coordinates.

awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep2.bed > MCF7_hg38_Encode_DHS_Rep2_peak.bed
awk '{OFS="\t"} {print $1, $2, $3}' MCF7_hg38_Encode_DHS_Rep1.bed > MCF7_hg38_Encode_DHS_Rep1_peak.bed

wc -l MCF7_hg38_Encode_DHS_Rep2.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_peak.bed #173226
head -5 MCF7_hg38_Encode_DHS_Rep2_peak.bed
#chr1   10180   10330
#chr1   16160   16310
#chr1   267909  268059
#chr1   586060  586210
#chr1   630180  630330

wc -l MCF7_hg38_Encode_DHS_Rep1.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_peak.bed #126690
head -5 MCF7_hg38_Encode_DHS_Rep1_peak.bed
#chr1   267969  268119
#chr1   586060  586210
#chr1   629900  630050
#chr1   630160  630310
#chr1   630480  630630

2.3.2 remove regions overlap with GATA3 binding sites

make sure the DHS regions are 161bp window

library(bigWig)
peak.region.summit1=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep1_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
peak.region.summit2=center.bed(read.table('MCF7_hg38_Encode_DHS_Rep2_peak.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)

peak.region.161bp.1=center.bed(peak.region.summit1, upstreamWindow = 80, downstreamWindow = 80)
peak.region.161bp.2=center.bed(peak.region.summit2, upstreamWindow = 80, downstreamWindow = 80)
head(peak.region.161bp.1)
##     V1     V2     V3
## 1 chr1 267964 268125
## 2 chr1 586055 586216
## 3 chr1 629895 630056
## 4 chr1 630155 630316
## 5 chr1 630475 630636
## 6 chr1 631375 631536
head(peak.region.161bp.2)
##     V1     V2     V3
## 1 chr1  10175  10336
## 2 chr1  16155  16316
## 3 chr1 267904 268065
## 4 chr1 586055 586216
## 5 chr1 630175 630336
## 6 chr1 630475 630636
write.table(peak.region.161bp.1,file= 'MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(peak.region.161bp.2,file= 'MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed', quote=F,sep="\t",col.names=F,row.names=F)

remove GATA3 motifs with MAST

The DHS regions contain all regulatory regions that are sensitive to DNAse. To make a independent negative control, we use MAST to remove regulatory regions that overlap with GATA3-like motif 12345678 (use same p-value stringency).

First convert regulatory region file from .bed to .fasta.

module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta
## chr1 267964  268125
## chr1 586055  586216
## chr1 629895  630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175   10336
## chr1 16155   16316
## chr1 267904  268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC

We have 8 motifs that found by MEME/STREME software.

ls individual_meme/
## AGATAAM_streme.txt
## GATAmotif1_meme.txt
## GATAmotif2_meme.txt
## GATAmotif3_meme.txt
## GATAmotif4_meme.txt
## GATAmotif5_meme.txt
## GATAmotif6_meme.txt
## TGATAA_streme.txt

MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed #5836
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round1.bed > MCF7DHS_Rep1_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1.bed #120854

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed #4978
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round2.bed > MCF7DHS_Rep1_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12.bed #115876

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed #2263
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round3.bed > MCF7DHS_Rep1_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123.bed #113613

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed #3237
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round4.bed > MCF7DHS_Rep1_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_1234.bed #110376

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed #2629
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round5.bed > MCF7DHS_Rep1_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_12345.bed #107747


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed #2800
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round6.bed > MCF7DHS_Rep1_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456.bed #104947

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed #8176
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round7.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed #96771


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed #4383
intersectBed -v -a MCF7DHS_Rep1_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round8.bed > MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed  #92388

MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed #5249
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round1.bed > MCF7DHS_Rep2_161bp_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1.bed #167977

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed #4249
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round2.bed > MCF7DHS_Rep2_161bp_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12.bed #163728

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed #1864
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round3.bed > MCF7DHS_Rep2_161bp_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123.bed #161864

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed #3075
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round4.bed > MCF7DHS_Rep2_161bp_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_1234.bed #158789

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed #2510
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round5.bed > MCF7DHS_Rep2_161bp_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_12345.bed #156279


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed #2689
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round6.bed > MCF7DHS_Rep2_161bp_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456.bed #153590

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed #8783
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round7.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed #144807


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed #4908
intersectBed -v -a MCF7DHS_Rep2_161bp_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round8.bed > MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed  #139899

We have discovered 8 motifs using MEME/STREME software. Previously, we concatenated them into a single motif matrix database file (“combined_output_meme.txt”). Now, we aim to perform a coherence check by using MAST to scan this concatenated file against the remaining regions.

#rep1
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_round9.txt 
#rep2
mast -hit_list -best combined_output_meme.txt MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_round9.txt 

Mast the concatenated file against the remaining regions do not give overlapped regions.

2.4 Independent control – remove GATA3 peak regions from the DHS regions

To make an independent control, we can remove all overlapped GATA3 ChIP-seq peak regions from the DHS regions. Then we will MAST against the remaining DHS regions, and see what is the random odds we could get GATA3 bindin region.

module load bedtools
sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/

slopBed -b 80 -i ${dir}GATA_ChIP_summits_final.bed -g $sizes  | sort -k1,1 -k2,2n > GATA_ChIP_summit_161window.bed

#remove overlapped GATA3 peak regions from DHS regions
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed -b GATA_ChIP_summit_161window.bed > MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_peak.bed #126690
wc -l MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed #84362
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_peak.bed #173226
wc -l MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed #131896

convert to fasta.

module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa

fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
fastaFromBed -fi $genome -bed MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -fo MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta
head -3 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed
head -6 MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta
## chr1 267964  268125
## chr1 586055  586216
## chr1 629895  630056
## >chr1:267964-268125
## GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTG
## >chr1:586055-586216
## TTTCCCACATTATTCAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCGGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGGGTGGGGGC
## >chr1:629895-630056
## TAACCAATACCACCAATCAATACTCATCATTAATAATCATAATGGCTATAGCAATAAAACTAGGAATAGCCCCCTTTCACTTCTGAGTCCCAGAGGTTACCCAAGGCACCCCTCTGACATCCGGCCTGCTCCTTCTCACATGACAAAAACTAGCCCCCATC
## chr1 10175   10336
## chr1 16155   16316
## chr1 267904  268065
## >chr1:10175-10336
## aacctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccctaaccctaaccctaaaccctaaaccctaaccctaaccctaaccctaaccctaaccccaaccccaaccccaaccccaaccccaaccccaaccctaacccctaa
## >chr1:16155-16316
## CAGATACTCCCTGCTTCCTCTCTAGCCCCCACCCTGCAGAGCTGGACCCCTGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACACGAGCCAGCAGAGGGGTTTTGTGCCACTTCTGGATGCTAGGGTTACACTGGGAGACACAGCAGTGAAGCTGAAA
## >chr1:267904-268065
## CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGC

MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed #1739
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep1_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round1.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed #82623

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed #1517
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round2.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed #81106

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed #802
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round3.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed #80304

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed #1479
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round4.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed #78825

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed #1219
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round5.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed #77606


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed #1239
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round6.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed #76367

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed #4663
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round7.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed #71704


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed #2585
intersectBed -v -a MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep1_noGATA_round8.bed > MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed  #69119

MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta

MEME (mast uses default p-value: 0.0001)

module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/individual_meme/

#round1
mast -hit_list -best ${dir}GATAmotif1_meme.txt MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed #1976
intersectBed -v -a MCF7_hg38_Encode_DHS_Rep2_161bp_noGATA_peak.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round1.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed #129920

#round2
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta
mast -hit_list -best ${dir}GATAmotif2_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed #1556
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round2.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed #128364

#round3
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta
mast -hit_list -best ${dir}GATAmotif3_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed #818
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round3.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed #127546

#round4
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta
mast -hit_list -best ${dir}GATAmotif4_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed #1688
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round4.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed #125858

#round5
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta
mast -hit_list -best ${dir}GATAmotif5_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed #1385
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_1234.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round5.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed #124473


#round6
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta
mast -hit_list -best ${dir}GATAmotif6_meme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed #1443
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_12345.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round6.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed #123030

STREME (mast uses p-value of 0.0005)

#round7
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ${dir}AGATAAM_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.txt
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed #5859
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round7.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed #117171


#round8
fastaFromBed -fi $genome -bed MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -fo MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ${dir}TGATAA_streme.txt MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt 
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.txt 
wc -l mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed #3397
intersectBed -v -a MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_MCF7_DHS_Rep2_noGATA_round8.bed > MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed
wc -l MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed  #113774

2.4.1 bar plot

bar plot (group by peak intensity)
The fraction of peaks with MEME-found-motif 123456, as well as STREME-found motif78, in each quantile:

cp /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/bar_plot/quantile.240103.sum.161window.txt .

We will add the two DHS file.

#MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed

for i in MCF7DHS*_161bp_without_motifs_123456_78.bed
do
 nm=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}')
 rep=$(echo $i | awk -F"_161bp_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_peak.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo $nm "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
#MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed
#MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed

for i in MCF7*_161bp_noGATA_without_motifs_123456_78.bed
do
 nm=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}')
 rep=$(echo $i | awk -F"_161bp_noGATA_without_motifs_123456_78.bed" '{print $1}' | awk -F"MCF7DHS_" '{print $2}')
 p=$(wc -l $i | awk '{OFS="\t";} {print $1}')
 totalp=$(wc -l MCF7_hg38_Encode_DHS_${rep}_161bp_noGATA_peak.bed | awk '{OFS="\t";} {print $1}')
 withoutMotif=$(bc <<< "scale=4 ; $p / $totalp")
 withMotif=$(bc <<< "scale=4 ; ($totalp- $p) / $totalp")
 echo "${nm}_noGATA" "" $withoutMotif "" $withMotif>> quantile.240103.sum.161window.txt
done
module load R/4.1.2 
R 

library("lattice") 
library("reshape2")


df_sum=read.table('quantile.240103.sum.161window.txt', sep = "", header=FALSE)
colnames(df_sum)=c("quantile","without_Motif","with_Motif")
df_sum_long <- df_sum

df_sum_long <- melt(df_sum_long, id = "quantile")
df_sum_long$variable<- factor(df_sum_long$variable, levels = c("with_Motif", "without_Motif"))

pdf('240105_percent_peaks_withorwithout_GATA3_motif_variant.pdf',width=8,height=5)
my.settings <- list(
  superpose.polygon=list(col=c("black", "grey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(value ~ quantile,         
         data = df_sum_long,
         groups = variable,
         stack = TRUE,
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "peak set within intensity quantile",
         xlab = "fraction of peaks with/without GATA3 motif variant",
         par.settings = my.settings)
)
dev.off()
percent peaks with or without GATA3 motif (variant 1,2,3,4,5,6,7,8):
percent peaks with or without GATA3 motif variants

Figure 3: percent peaks with or without GATA3 motif variants

For peaks in the top quantile intensity (top 5% highest intensity), we have ~16% peaks MEME/STREME are not able to find GATA3 binding elements.

We have added full DHS control and an independent DHS control.

Analysis steps:
1) Retrieved MCF7 DHS data (rep1 & rep2) from GSE29692 and used UCSC liftover to convert the data from hg19 to hg38.
2) Modified the DHS regions to 161bp windows.
3) Full DHS control: Applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
4) DHS regions removed GATA3 peak regions: use intersectBed to remove overlapped GATA3 ChIP-seq peak regions from the DHS regions, then applied the same MAST stringency to eliminate regions that overlapped with the 8 MEME/STREME de novo GATA3-like motifs previously identified in GATA3 ChIP peaks.
5) Coherence check: Employed MAST on the concatenated motif database against the regions with removed GATA3 motifs —no further regions were found.
6) Counted the regions with and without GATA3 motifs and added them to the bar plot.

Conclusion:
The left four bars are the MCF7 DHS controls. From left to right, they are full rep1 DHS regions, rep1 DHS regions that removed overlapped GATA3 ChIP peak regions, full rep2 DHS regions and rep2 DHS regions that removed overlapped GATA3 ChIPpeak regions.
Among the GATA3 peaks in the lowest intensity quantile (quantile 0.05), over 45% of the peaks contain GATA3 binding elements as identified by MEME/STREME.
The full DHS regions show a 27% overlap with GATA3 motifs in the 1st replicate and a 19% overlap in the 2nd replicate.
The DHS regions removed overlapped GATA3 peak regions has 18% (rep1) and 13.7% (rep2) peaks with GATA3 motifs.

This indicates that randomly we would expect a 14~18% regions has GATA3 motifs found. And GATA3 binding is more specific to regions within the genome (represented by GATA3 ChIP-seq peaks) rather than being uniformly distributed across all accessible DNA regions (represented by the DHS regions).
Consequently, we can infer that even within the GATA3 peaks falling within the lowest quantile, there are significant and meaningful binding sites for GATA3.

2.5 GATA3 ChIP peak without MEME/STREME found GATA3 motif

Background: we have GATA3 peak that have exhaustively searched for GATA3-like motif with MEME/STREME software. However, there are ~38.5% peak failed in finding a GATA3-like motif. Even for peak subset that have relatively high peak intensity (top5%), we still have ~16% peaks that MEME/STREME cannot find a GATA3-like motif.

Generally, we would expect to find binding sequences within the peak region, so that the ChIPed transcription factor (in our case, GATA3) can bind to. This is even more true for peaks with relatively high intensity.

Our question now became, is MEME/STREME limited in finding binding sequences patterned like GATA3 (that has fixed spacings between two 3mer)? Can we find the 3mer-3mer sequences in GATA3 peaks that do not contain the MEME/STREME found GATA3-like motifs?

2.6 Process the negative control

2.6.1 MCF7 DHS

There are four DHS negative controls: two reps of full MCF7 DHS regions (refer as “full DHS control”), two reps of MCF7 DHS regions without GATA3 ChIP peak regions (From here, I will refer this sets of control as “independent DHS control”).
All DHS negative controls are removed of the 8 GATA3-like motifs, because we want to compare them to the GATA3 peaks that MEME/STREME failed to find GATA3-like de novo motifs.

  • “full DHS control”
MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed  #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed  #139899
  • “independent DHS control”
MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed  #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed  #113774

2.7 Process the positive control

There are 8 GATA3-like motifs found by MEME/STREME. Among them, 5 of them have GATA3 binding motifs that are formed by two GAT (or ATC) 3mer and various spacings.

I will utilize MAST against the 161bp window peaks for all 8 motifs (the concatenated database file) to identify motif coordinates corresponding to each motif. Subsequently, I’ll employ intersectBed to isolate the peak sets containing the best motif within the 161bp window. The peak sets characterized by the presence of the specific motif structures that we are insterested in will serve as positive controls.

#GATA_ChIP_summit_161window.bed #the 161bp window around peak summit from the full GATA3 peak
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
dir2=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/

fastaFromBed -fi $genome -bed ${dir}GATA_ChIP_summit_161window.bed -fo GATA_ChIP_summit_161window.fasta


mast -hit_list -best ${dir2}combined_output_meme.txt GATA_ChIP_summit_161window.fasta > mast_GATA3_PSWM_in_peaks_161win.txt 

wget https://raw.githubusercontent.com/sysunn/siyu_daily_update/main/December_2023/parse_multi_mast_to_coordinates.R
Rscript parse_multi_mast_to_coordinates.R mast_GATA3_PSWM_in_peaks_161win.txt  8
wc -l mast_GATA3_PSWM_in_peaks_161win.bed
head -3 mast_GATA3_PSWM_in_peaks_161win.bed
tail -3 mast_GATA3_PSWM_in_peaks_161win.bed
awk '{print > ("mast_GATA3_PSWM_in_peak_161win_motif_" $7 ".bed")}' mast_GATA3_PSWM_in_peaks_161win.bed
ls mast_GATA3_PSWM_in_peak_161win_motif*.bed
##    63124 mast_GATA3_PSWM_in_peaks_161win.bed
## chr1 1505049 1505056 1230.47 6.56e-05    +   1
## chr1 1883638 1883645 1230.47 6.56e-05    +   1
## chr1 5603508 5603515 1312.58 2.16e-05    -   1
## chrX 155296326   155296337   1713.86 1.94e-06    -   8
## chrY 786683  786694  1635.44 4.15e-06    -   8
## chrY 4437398 4437409 1439.4  1.66e-05    -   8
## mast_GATA3_PSWM_in_peak_161win_motif_1.bed
## mast_GATA3_PSWM_in_peak_161win_motif_2.bed
## mast_GATA3_PSWM_in_peak_161win_motif_3.bed
## mast_GATA3_PSWM_in_peak_161win_motif_4.bed
## mast_GATA3_PSWM_in_peak_161win_motif_5.bed
## mast_GATA3_PSWM_in_peak_161win_motif_7.bed
## mast_GATA3_PSWM_in_peak_161win_motif_8.bed

Notice that the motif indices here is based on the PWM order in the concatenated database file:
1: MOTIF AGATAARR –MEME-round3 motif3
2: MOTIF WGATBDHRVAGATAA –MEME-round6 motif6
3: MOTIF BTTATCWGATB –MEME-round5 motif5
4: MOTIF 4-AGATAAM. –STREME motif 1
5: MOTIF AGATNDWNAGATARN. –MEME-round4 motif4
6: MOTIF 6-TGATAA. –STREME motif 2
7: MOTIF WGATBTTATCW –MEME-round1 motif1
8: MOTIF WGATAARVATCW –MEME-round2 motif2

We don’t have peaks assigned with motif indice==6, that means previously we have a set of peak assigned with STREME motif2 (motif8) are 1) no longer defined as contain motifs, or 2) now assigned with other motifs.

  • round1 motif: GAT—ATC
#mast motif indice==7
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_7.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_1.bed
wc -l GATA3_peak_161win_with_motif_1.bed #13273 
head -5 GATA3_peak_161win_with_motif_1.bed
#chr1   1080476 1080637 GATA_ChIP_peak_45   6.37157
#chr1   1238157 1238318 GATA_ChIP_peak_51   16.8938
#chr1   1686351 1686512 GATA_ChIP_peak_63   35.0463
#chr1   1741050 1741211 GATA_ChIP_peak_68   11.6965
#chr1   2001941 2002102 GATA_ChIP_peak_77   7.79887
  • round2 motif: GAT—-ATC
#mast motif indice==8
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_8.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_2.bed
wc -l GATA3_peak_161win_with_motif_2.bed #12373 
head -5 GATA3_peak_161win_with_motif_2.bed
#chr1   996069  996230  GATA_ChIP_peak_35   38.3686
#chr1   3213798 3213959 GATA_ChIP_peak_87   118.864
#chr1   3831729 3831890 GATA_ChIP_peak_96   162.768
#chr1   5688567 5688728 GATA_ChIP_peak_109  93.8407
#chr1   5959750 5959911 GATA_ChIP_peak_110  65.9829
  • round4 motif: GAT—–GAT
#mast motif indice==5
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_5.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_4.bed
wc -l GATA3_peak_161win_with_motif_4.bed #9180 
head -5 GATA3_peak_161win_with_motif_4.bed
#chr1   869417  869578  GATA_ChIP_peak_30   3.78065
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238
#chr1   1859356 1859517 GATA_ChIP_peak_75b  62.1909
#chr1   3926997 3927158 GATA_ChIP_peak_98   14.2308
#chr1   4659185 4659346 GATA_ChIP_peak_102  54.7665
  • round5 motif: ATC-GAT
#mast motif indice==3
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_3.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_5.bed
wc -l GATA3_peak_161win_with_motif_5.bed #6703
head -5 GATA3_peak_161win_with_motif_5.bed
#chr1   1858826 1858987 GATA_ChIP_peak_75a  32.8724
#chr1   3948515 3948676 GATA_ChIP_peak_99   53.5507
#chr1   4597226 4597387 GATA_ChIP_peak_100  8.5429
#chr1   6317519 6317680 GATA_ChIP_peak_118  14.2308
#chr1   7320212 7320373 GATA_ChIP_peak_146  40.0204
  • round6 motif: GAT——GAT
#mast motif indice==2
intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peak_161win_motif_2.bed | sort -k1,1 -k2,2n | uniq > GATA3_peak_161win_with_motif_6.bed
wc -l GATA3_peak_161win_with_motif_6.bed #9352
head -5 GATA3_peak_161win_with_motif_6.bed
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238
#chr1   6719989 6720150 GATA_ChIP_peak_129  10.6294
#chr1   6800848 6801009 GATA_ChIP_peak_133  261.961
#chr1   6959568 6959729 GATA_ChIP_peak_137  253.515
#chr1   7266109 7266270 GATA_ChIP_peak_145  33.955

GATA3_peak_161win_with_motif_1.bed
GATA3_peak_161win_with_motif_2.bed
GATA3_peak_161win_with_motif_4.bed
GATA3_peak_161win_with_motif_5.bed
GATA3_peak_161win_with_motif_6.bed

2.7.1 coherence check –barchart

When using mast to assess all 8 motifs concurrently, the -best option selectively assigns each peak region with only the most significant motif hit from the list of motif sites.

Previously, STREME identified a set of peaks (n=4452) containing motif8 (STREME motif2), which are now either defined as not containing motifs or associated with motifs other than motif8. Illustrating the fraction of peaks assigned with alternative motifs in this peak set via a bar chart would offer a visual representation of this transition.

module load meme/5.4.1
module load R/4.1.2
module load bedtools
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/

#101 bp window
intersectBed -wa -a ${dir}without_motifs_123456_7.bed -b ${dir}mast_GATA3_PSWM_in_peaks_round8.bed |uniq > peak_101win_with_motif8.bed #4452
head peak_101win_with_motif8.bed
#chr1   1775793 1775894 GATA_ChIP_peak_72   24.5459
#chr1   1855841 1855942 GATA_ChIP_peak_74   3.19627
#chr1   1883543 1883644 GATA_ChIP_peak_76   36.5894
#chr1   2135560 2135661 GATA_ChIP_peak_79   18.7357
#chr1   3613537 3613638 GATA_ChIP_peak_90   27.5967

#161bp window
sizes=/home/FCAM/ssun/Genome/hg38.chrom.sizes
slopBed -b 30 -i peak_101win_with_motif8.bed -g $sizes  | sort -k1,1 -k2,2n > peak_161win_with_motif8.bed #4452

When we mast with motif database, we have generated a bed file “mast_GATA3_PSWM_in_peaks_161win.bed” contains motif coordinates info. The 7th column is the motif indices.

head -5 mast_GATA3_PSWM_in_peaks_161win.bed #63124
#chr1   1505049 1505056 1230.47 6.56e-05    +   1
#chr1   1883638 1883645 1230.47 6.56e-05    +   1
#chr1   5603508 5603515 1312.58 2.16e-05    -   1
#chr1   6795903 6795910 1312.58 2.16e-05    +   1
#chr1   7549563 7549570 1230.47 6.56e-05    +   1

We can use intersectBed to intersect these motif coordinates with the full GATA3 peaks (161window) and get all peaks that contains GATA3-like motifs.
Notice that there are peaks containing more than 1 motifs.

intersectBed -wa -wb -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq | head -5
#chr1   869417  869578  GATA_ChIP_peak_30   3.78065 chr1    869461  869467  1284.91 9.58e-05    -   4
#chr1   869417  869578  GATA_ChIP_peak_30   3.78065 chr1    869468  869482  1223.57 5.03e-05    -   5
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238 chr1    917504  917518  1111.37 7.37e-05    +   5
#chr1   917454  917615  GATA_ChIP_peak_32   75.0238 chr1    917551  917565  1076.4  8.4e-05 +   2
#chr1   996069  996230  GATA_ChIP_peak_35   38.3686 chr1    996149  996160  1419.8  1.97e-05    +   8

We can calculate the number of non-duplicated peaks that contain at least one motif from the motif database (all 8 motifs) defined by mast use a stringency of 0.0001.

intersectBed -wa -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | awk '{OFS="\t"} {print $4}' | uniq | wc -l #51122

Recall that in the Exhaustive MEME/STREME analysis, each round we mast with single motif PMW against peaks. And for the last two round of STREME, we have loosen the mast stringency to 0.005.
Therefore, we end up have more peaks defined as having motifs (n=59560) compare to here when we applied a global mast strincy of 0.0001 (n=51122).

By simple math, there are 8438 peaks previously defined as containing motifs now are no longer having motifs.
We can get this peak set:

#new set of peak do not contain motifs
intersectBed -v -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | sort -k1,1 -k2,2n | uniq > without_motifs_all_161bp_mast_new.bed #45746
head -5 without_motifs_all_161bp_mast_new.bed
#chr1   827300  827461  GATA_ChIP_peak_28   9.30478
#chr1   845716  845877  GATA_ChIP_peak_29   541.83
#chr1   916689  916850  GATA_ChIP_peak_31   7.79887
#chr1   924773  924934  GATA_ChIP_peak_33   3.78065
#chr1   966573  966734  GATA_ChIP_peak_34   3.78065

#old set of peak do not contain motifs
dir2=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/
head -5 ${dir2}without_motifs_123456_78_161bp_mast.bed #37308 peaks previously defined as not containing motif
#chr1   827300  827461  GATA_ChIP_peak_28   9.30478
#chr1   916689  916850  GATA_ChIP_peak_31   7.79887
#chr1   924773  924934  GATA_ChIP_peak_33   3.78065
#chr1   966573  966734  GATA_ChIP_peak_34   3.78065
#chr1   999428  999589  GATA_ChIP_peak_36   2.11515

#old set of peak contain motifs
intersectBed -v -f 1 -a ${dir}GATA_ChIP_summit_161window.bed -b ${dir2}without_motifs_123456_78_161bp_mast.bed > with_motifs_all_161bp_mast_old.bed #59560 (==96868-37308)

# get peaks that previously contain motifs now no longer contain motifs by mast with db
awk 'FNR==NR{peaks[$4]; next} $4 in peaks' with_motifs_all_161bp_mast_old.bed without_motifs_all_161bp_mast_new.bed > peak_no_longer_contain_motifs.bed #8439

head -5 peak_no_longer_contain_motifs.bed
#chr1   845716  845877  GATA_ChIP_peak_29   541.83
#chr1   1746867 1747028 GATA_ChIP_peak_70   4.46672
#chr1   1775763 1775924 GATA_ChIP_peak_72   24.5459
#chr1   1855811 1855972 GATA_ChIP_peak_74   3.19627
#chr1   2135530 2135691 GATA_ChIP_peak_79   18.7357

For our previous defined peak set that contain motif8 (STREME motif2) (n=4452):

we expect 1) some peak no longer contain motifs:

intersectBed -wa -wb -f 1 -F 1 -a peak_no_longer_contain_motifs.bed -b peak_161win_with_motif8.bed > peak_161win_with_motif8_no_longer_contain_motifs.bed #3528
head -5 peak_161win_with_motif8_no_longer_contain_motifs.bed
#chr1   1775763 1775924 GATA_ChIP_peak_72   24.5459 chr1    1775763 1775924 GATA_ChIP_peak_72   24.5459
#chr1   1855811 1855972 GATA_ChIP_peak_74   3.19627 chr1    1855811 1855972 GATA_ChIP_peak_74   3.19627
#chr1   2135530 2135691 GATA_ChIP_peak_79   18.7357 chr1    2135530 2135691 GATA_ChIP_peak_79   18.7357
#chr1   3613507 3613668 GATA_ChIP_peak_90   27.5967 chr1    3613507 3613668 GATA_ChIP_peak_90   27.5967
#chr1   6359876 6360037 GATA_ChIP_peak_119  32.8724 chr1    6359876 6360037 GATA_ChIP_peak_119  32.8724


${dir2}without_motifs_123456_78_161bp_mast.bed | uniq > peak_no_longer_contain_motifs.bed #8416
head -5 peak_no_longer_contain_motifs.bed
#chr1   845716  845877  GATA_ChIP_peak_29   541.83
#chr1   1746867 1747028 GATA_ChIP_peak_70   4.46672
#chr1   1775763 1775924 GATA_ChIP_peak_72   24.5459
#chr1   1855811 1855972 GATA_ChIP_peak_74   3.19627
#chr1   2135530 2135691 GATA_ChIP_peak_79   18.7357

intersectBed -wa -wb -a peak_no_longer_contain_motifs.bed -b peak_161win_with_motif8.bed | uniq > peak_161win_with_motif8_no_longer_contain_motifs.bed #3525
#coherence check
#there are peaks assigned with more than 1 motif
awk '{OFS="\t"} {print $4}' peak_161win_with_motif8_no_longer_contain_motifs.bed | uniq | wc -l #3524

Previously we have defined 4452 peaks contain motif8, now we have 3528 peaks no longer defined as containing motifs.

  1. some peak now assigned with other motifs

To extract peak and motif information: we can use the peak indices and awk to filter the rows in GATA3_peak_161win_with_all_mast_motif.bed based on whether their peak index (in the fourth column) is present in peak_161win_with_motif8.bed (peaks that previously found to contain motif8).

# all peaks that contain newly assigned motifs by mast to db
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/
intersectBed -wa -wb -a ${dir}GATA_ChIP_summit_161window.bed -b mast_GATA3_PSWM_in_peaks_161win.bed | awk '{OFS="\t"} {print $1, $2, $3, $4, $12}' | uniq  > GATA3_peak_161win_with_all_mast_motif.bed #63127

awk 'FNR==NR{peaks[$4]; next} $4 in peaks' peak_161win_with_motif8.bed GATA3_peak_161win_with_all_mast_motif.bed > filtered_GATA3_peak_161win_with_all_mast_motif.bed #997
head -5 filtered_GATA3_peak_161win_with_all_mast_motif.bed
#chr1   1883513 1883674 GATA_ChIP_peak_76   1
#chr1   3831729 3831890 GATA_ChIP_peak_96   8
#chr1   4597226 4597387 GATA_ChIP_peak_100  3
#chr1   8875638 8875799 GATA_ChIP_peak_226  7
#chr1   10277773    10277934    GATA_ChIP_peak_290  5


#coherence check
#there are peaks assigned with more than 1 motif
awk '{OFS="\t"} {print $4}' filtered_GATA3_peak_161win_with_all_mast_motif.bed | uniq | wc -l #924

There are 924 unique peaks previously defined as containing motif8 (STREME motif2) now assigned with other motif(s) by mast.

#module load R/4.1.2 
#R 

library(lattice) 
bar=read.table(file = "bar.txt", header = T)
bar$dum.x="peak fraction"
bar$supp=factor(bar$supp, levels=c("new_motif","no_motif"))

pdf('240111_peaks_with_GATA3_motif8_change_by_new_mast.pdf', width=5,height=6)
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c("blue","lightgrey"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(fraction ~ dum.x,         
         data = bar,
         groups = supp,
         stack = TRUE,
         ylim=c(0,1),
         auto.key=list(space="right"),
         #scales = list(x = list(rot = 45)),
         ylab = "peak set previously contain motif8",
         xlab = "fration of peaks changed when mast with motif db",
         par.settings = my.settings)
)
dev.off()
peak_with_motif8 either no longer defined as containing motifs or assigned with new motifs when mast with database

Figure 4: peak_with_motif8 either no longer defined as containing motifs or assigned with new motifs when mast with database

# Read the BED file
bed_data <- read.table("filtered_GATA3_peak_161win_with_all_mast_motif.bed", sep="\t", header=FALSE, stringsAsFactors=FALSE)

# Aggregate motif indices for each peak
aggregated_motifs <- aggregate(bed_data$V5, by = list(bed_data$V4), FUN = function(x) paste(unique(x), collapse = ","))

aggregated_motifs$nmotif <- sapply(aggregated_motifs$x, function(x) length(unlist(strsplit(x, ","))))
colnames(aggregated_motifs)=c("peak", "motif_indices", "nmotif")

head(aggregated_motifs)
#                   peak motif_indices nmotif
#1    GATA_ChIP_peak_100             3      1
#2  GATA_ChIP_peak_10048             1      1
#3 GATA_ChIP_peak_10198b           7,3      2
#4  GATA_ChIP_peak_10245             8      1
#5  GATA_ChIP_peak_10360             7      1
#6 GATA_ChIP_peak_10760a             2      1

nrow(aggregated_motifs)
#[1] 924
length(unique(aggregated_motifs$peak))
#[1] 924
str(aggregated_motifs)
#'data.frame':  924 obs. of  3 variables:
# $ peak         : chr  "GATA_ChIP_peak_100" "GATA_ChIP_peak_10048" "GATA_ChIP_peak_10198b" "GATA_ChIP_peak_10245" ...
# $ motif_indices: chr  "3" "1" "7,3" "8" ...
# $ nmotif       : int  1 1 2 1 1 1 2 1 1 1 ...
library(lattice)
group_counts <- as.data.frame(table(aggregated_motifs$motif_indices))
names(group_counts) <- c('Group', 'Count')

summary(group_counts$Count)
#Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 #     1       1       2      22       4     193

group_counts$Group=factor(group_counts$Group, levels=c( "1", "2", "3", "4", "5",  "7", "8", "1,2", "1,4", "1,5", "1,8",  "2,1", "2,3",    
 "2,7", "2,8", "3,1", "3,2",  "3,4", "3,7", "3,8", "4,2","4,3", "4,5", "4,8", "5,1","5,2", "5,3","5,4", "5,7", "5,8", "7,1", "7,2","7,3", "7,4", "7,5", "7,8", "8,1",  "8,2", "8,3", "8,4", "8,5", "8,7" ))

      

# Create the bar plot
pdf('240110_peaks_with_GATA3_motif_variant.pdf',width=15,height=10)
#custom_labels <- seq(1, 4000, by = 20)
print(barchart(Count ~ Group,
         data= group_counts, 
         col = "skyblue", 
         ylim=c(0, 200),
         xlab = "motif_indices", 
         ylab = "Number of Peaks",
         main = "Number of Peaks contain the motif",
         scales = list(x = list(rot = 45)), #y = list(at = custom_labels)
         horizontal = FALSE)
      )
dev.off()
peak_with_motif6 assigned with new motifs when mast with database

Figure 5: peak_with_motif6 assigned with new motifs when mast with database

The peak sets that used to defined as containing motif8 when mast single motif (motif8) against peaks without motif12345, are now assigned with other motifs when mast against entire GATA3 peak sets with all 8 motifs.

Summary

Peaks that were previously identified to contain motif8 through mast with a single motif PWM are no longer associated with motif8 when mast is applied to the motif database encompassing all eight motifs. This shift arises from two primary factors.

Firstly, some peaks are now unassigned to any motif, attributed to the heightened stringency (0.0001) in the current analysis, as opposed to the less stringent criterion (0.0005) employed in the previous round.

Conversely, certain peaks are now affiliated with alternative motifs. This shift can be attributed to the improved statistical calculations facilitated by mast using the -best option, enhancing motif assignments beyond the singular focus on motif8.

2.7.2 coherence check2

Prepare and upload these files to UCSC Genome browser:
a BED for peak (subset, 924 peaks previously enriched with motif8 now have assigned with other motifs) coordinates (101bp or 161bp).

peak_101win_with_motif8.bed #4452
peak_161win_with_motif8.bed #4452

awk '{OFS="\t"} {print $1, $2, $3, $4}' filtered_GATA3_peak_161win_with_all_mast_motif.bed | uniq > peak_161win_with_motif8_with_new_motifs.bed #924


head -5 peak_161win_with_motif8_with_new_motifs.bed #-161bp
#chr1   1883513 1883674 GATA_ChIP_peak_76
#chr1   3831729 3831890 GATA_ChIP_peak_96
#chr1   4597226 4597387 GATA_ChIP_peak_100
#chr1   8875638 8875799 GATA_ChIP_peak_226
#chr1   10277773    10277934    GATA_ChIP_peak_290

a BED for each newly assigned motif coordinates in the 161/101 window in the browser.

# newly assigned motif coordinates for all peaks
7431 mast_GATA3_PSWM_in_peak_161win_motif_1.bed
   9352 mast_GATA3_PSWM_in_peak_161win_motif_2.bed
   6702 mast_GATA3_PSWM_in_peak_161win_motif_3.bed
   4815 mast_GATA3_PSWM_in_peak_161win_motif_4.bed
   9180 mast_GATA3_PSWM_in_peak_161win_motif_5.bed
  13271 mast_GATA3_PSWM_in_peak_161win_motif_7.bed
  12373 mast_GATA3_PSWM_in_peak_161win_motif_8.bed

#intersect with the 924 peaks previously contain motif8

intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_1.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_1_PSWM_in_peak_161win_previous_with_motif8.bed #104
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_2.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_2_PSWM_in_peak_161win_previous_with_motif8.bed #173
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_3.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_3_PSWM_in_peak_161win_previous_with_motif8.bed #104
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_4.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_4_PSWM_in_peak_161win_previous_with_motif8.bed #67
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_5.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_5_PSWM_in_peak_161win_previous_with_motif8.bed #130
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_7.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_7_PSWM_in_peak_161win_previous_with_motif8.bed #196
intersectBed -wa -wb -a mast_GATA3_PSWM_in_peak_161win_motif_8.bed -b peak_161win_with_motif8_with_new_motifs.bed > motif_8_PSWM_in_peak_161win_previous_with_motif8.bed #224

head -5 motif_1_PSWM_in_peak_161win_previous_with_motif8.bed
#chr1   1883638 1883645 1230.47 6.56e-05    +   1   chr1    1883513 1883674 GATA_ChIP_peak_76
#chr1   20232820    20232827    1230.47 6.56e-05    -   1   chr1    20232679    20232840    GATA_ChIP_peak_603
#chr1   37484682    37484689    1230.47 6.56e-05    -   1   chr1    37484679    37484840    GATA_ChIP_peak_1230
#chr1   89045539    89045546    1230.47 6.56e-05    +   1   chr1    89045405    89045566    GATA_ChIP_peak_2681
#chr1   95084738    95084745    1312.58 2.16e-05    -   1   chr1    95084600    95084761    GATA_ChIP_peak_2867

a BED for the previous motif8 coordinates in 101 window in the browser.

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8/

#101 bp window
intersectBed -wa -wb -a ${dir}mast_GATA3_PSWM_in_peaks_round8.bed -b ${dir}without_motifs_123456_7.bed | uniq  > old_mast_motif_8_PSWM_in_peak_101win.bed #4454

head -5 old_mast_motif_8_PSWM_in_peak_101win.bed
#chr1   1775856 1775861 1142.66 0.000349    +   1   chr1    1775793 1775894 GATA_ChIP_peak_72   24.5459
#chr1   1855920 1855925 1142.66 0.000349    +   1   chr1    1855841 1855942 GATA_ChIP_peak_74   3.19627
#chr1   1883563 1883568 1142.66 0.000349    -   1   chr1    1883543 1883644 GATA_ChIP_peak_76   36.5894
#chr1   2135607 2135612 1142.66 0.000349    -   1   chr1    2135560 2135661 GATA_ChIP_peak_79   18.7357
#chr1   3613592 3613597 1142.66 0.000349    -   1   chr1    3613537 3613638 GATA_ChIP_peak_90   27.5967

awk 'FNR==NR{peaks[$4]; next} $11 in peaks' peak_161win_with_motif8_with_new_motifs.bed old_mast_motif_8_PSWM_in_peak_101win.bed > old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed #925

head -5 old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed
#chr1   1883563 1883568 1142.66 0.000349    -   1   chr1    1883543 1883644 GATA_ChIP_peak_76   36.5894
#chr1   3831836 3831841 1142.66 0.000349    -   1   chr1    3831759 3831860 GATA_ChIP_peak_96   162.768
#chr1   4597295 4597300 1142.66 0.000349    -   1   chr1    4597256 4597357 GATA_ChIP_peak_100  8.5429
#chr1   8875760 8875765 1142.66 0.000349    +   1   chr1    8875668 8875769 GATA_ChIP_peak_226  8.89252
#chr1   10277867    10277872    1142.66 0.000349    -   1   chr1    10277803    10277904    GATA_ChIP_peak_290  5.68898

Add trackline:

#peak region
peak_101win_with_motif8.bed #4452
peak_161win_with_motif8.bed #4452
peak_161win_with_motif8_with_new_motifs.bed #924

#motifs-old
old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed #925

#motifs-new
motif_1_PSWM_in_peak_161win_previous_with_motif8.bed
motif_2_PSWM_in_peak_161win_previous_with_motif8.bed
motif_3_PSWM_in_peak_161win_previous_with_motif8.bed
motif_4_PSWM_in_peak_161win_previous_with_motif8.bed
motif_5_PSWM_in_peak_161win_previous_with_motif8.bed
motif_7_PSWM_in_peak_161win_previous_with_motif8.bed
motif_8_PSWM_in_peak_161win_previous_with_motif8.bed

awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"peak_101win_with_motif8.bed\" description=\"peak_101win_full\" visibility=full autoScale=on useScore=1 color=0,0,0"
            } {print $0}' peak_101win_with_motif8.bed > peak_101win_with_motif8.header.bed
            
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"peak_161win_with_motif8.bed\" description=\"peak_161win_full\" visibility=full autoScale=on useScore=1 color=0,0,0"
            } {print $0}' peak_161win_with_motif8.bed > peak_161win_with_motif8.header.bed

awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"peak_161win_with_motif8_new.bed\" description=\"peak_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,0"
            } {print $0}' peak_161win_with_motif8_with_new_motifs.bed > peak_161win_with_motif8_with_new_motifs.header.bed
           



awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed\" description=\"motif8_101win_old\" visibility=full autoScale=on useScore=1 color=255,0,0"
            } {print $1, $2, $3}' old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.bed > old_mast_motif_8_PSWM_in_peak_101win_with_new_motifs.header.bed
           




awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_1_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif1_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_1_PSWM_in_peak_161win_previous_with_motif8.bed > motif_1_PSWM_in_peak_161win_previous_with_motif8.header.bed
          
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_2_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif2_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_2_PSWM_in_peak_161win_previous_with_motif8.bed > motif_2_PSWM_in_peak_161win_previous_with_motif8.header.bed

awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_3_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif3_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_3_PSWM_in_peak_161win_previous_with_motif8.bed > motif_3_PSWM_in_peak_161win_previous_with_motif8.header.bed

awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_4_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif4_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_4_PSWM_in_peak_161win_previous_with_motif8.bed > motif_4_PSWM_in_peak_161win_previous_with_motif8.header.bed
            
            
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_5_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif5_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_5_PSWM_in_peak_161win_previous_with_motif8.bed > motif_5_PSWM_in_peak_161win_previous_with_motif8.header.bed
            
            
            
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_7_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif7_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_7_PSWM_in_peak_161win_previous_with_motif8.bed > motif_7_PSWM_in_peak_161win_previous_with_motif8.header.bed
            
            
awk 'BEGIN {print "browser position chr10:16,000-17,000" 
            print "track type=bed name=\"motif_8_PSWM_in_peak_161win_previous_with_motif8.bed\" description=\"motif8_161win_new\" visibility=full autoScale=on useScore=1 color=0,0,255"
            } {print $1, $2, $3}' motif_8_PSWM_in_peak_161win_previous_with_motif8.bed > motif_8_PSWM_in_peak_161win_previous_with_motif8.header.bed

calculate the distance between old motif8 with newly assigned motifs
anchor at old motif coordinates summit, then calculate the distance from old motif coordinates to newly assigned motif coordinates in each peak region.

3 GAT-GAT/ATC 3mer analaysis

  1. get all peaks (161bp window) without the 8 MEME/STREME motifs. Parse the peaks to 4 quantile based on peak intensity (use deseq2).

  2. for each quantile, as well as the two negative control (MCF7 DHS rep1, rep2), get the closest GAT to peak summit.

  3. find the second closest GAT.
    Make sure that the we want to use the relative distance between the two zinc fingers.
    and orientation.

3.1 Parse peak based on Intensity

peaks (161bp window) without 8 GATA3-motif (that found by MEME/STREME)

#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Exhaustive_MEME_MAST_round1to8
wc -l without_motifs_123456_78_161bp_mast.bed #37308 
head -5 without_motifs_123456_78_161bp_mast.bed
#chr1   827300  827461  GATA_ChIP_peak_28   9.30478
#chr1   916689  916850  GATA_ChIP_peak_31   7.79887
#chr1   924773  924934  GATA_ChIP_peak_33   3.78065
#chr1   966573  966734  GATA_ChIP_peak_34   3.78065
#chr1   999428  999589  GATA_ChIP_peak_36   2.11515

3.1.1 Peak Intensity - bigWig and DEseq2

In the above file, the last column is the peak intensity reported by MACS3.
Notice that a peak with high MACS3-intensity is not necessarily a intense peak. There are two things we need to consider regarding peak intensities. The first is peak region; the second is dynamic range. Imaging a peak that is very intense but within a narrow range, the other peak is not so intense but can span its signals across a very long distance. We need to consider this region differences while calling any peak to be intense or not.

First use Sam Flag 0x3(reads paired and mapped in proper pair) to calculate the read depth for each GATA library.

# calculate the size factors 
module load samtools/1.12

dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
for i in GATA
do
  echo $i
  > ${i}_header.txt
  > ${i}_reads.txt
  for j in ${dir}MCF7_dTAGGATA522*_${i}_*.sorted.bam
  do
    echo $j
    name=$(echo $j | awk -F $dir '{print $2}' | awk -F".sorted.bam" '{print $1}')
    echo $name | paste ${i}_header.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_header.txt
    reads=`samtools view -c -f 0x3 $j` #count the reads paired and mapped in proper pair
    echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt 
    mv ${i}_tmp.txt ${i}_reads.txt
  done  
  cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
  mv ${i}_tmp.txt ${i}_reads.txt
  rm ${i}_header.txt
done 
cat GATA_reads.txt
##  MCF7_dTAGGATA522_GATA_CC_rep1   MCF7_dTAGGATA522_GATA_CC_rep2   MCF7_dTAGGATA522_GATA_CC_rep3   MCF7_dTAGGATA522_GATA_CE_rep1   MCF7_dTAGGATA522_GATA_CE_rep2   MCF7_dTAGGATA522_GATA_CE_rep3   MCF7_dTAGGATA522_GATA_dE_rep1   MCF7_dTAGGATA522_GATA_dE_rep2   MCF7_dTAGGATA522_GATA_dE_rep3
##  33948616    32585396    34475586    51112588    147834968   136838760   34136142    136271358   85665512

bargraph

library(lattice)
df=as.data.frame(t(read.table("GATA_reads.txt", header=F)))
colnames(df)=c("library", "aligned_reads")
df$libraey=as.factor(df$library)
df$aligned_reads=as.numeric((df$aligned_reads))

barchart(aligned_reads ~ library, 
         data = df,
         ylim=c(0, max(df$aligned_reads)*1.04),
         col = "skyblue",
         scales = list(x = list(rot = 45)),
         xlab = "GATA3 ChIP library", 
         ylab = "concordantly aligned reads"
         )

This bar graph represents the read depth in each GATA3 ChIP-seq library. In the downstream analysis, we will need to use DESeq2 to normalize the counts in each library with size factor to account for this read depth difference.  

Get peak intensity within 400bp window withDeseq2 for peaks without MEME/STREME found motifs 12345678.

load peaks without motif 12345678.

#module load R/4.1.2
#R
a =  read.table('without_motifs_123456_78_161bp_mast.bed', sep = "\t", header=FALSE) 
nrow(a) #37308
## [1] 37308
head(a)
##     V1      V2      V3                V4      V5
## 1 chr1  827300  827461 GATA_ChIP_peak_28 9.30478
## 2 chr1  916689  916850 GATA_ChIP_peak_31 7.79887
## 3 chr1  924773  924934 GATA_ChIP_peak_33 3.78065
## 4 chr1  966573  966734 GATA_ChIP_peak_34 3.78065
## 5 chr1  999428  999589 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000456 1000617 GATA_ChIP_peak_37 5.52004

Increase the width to 400bp window.

library(bigWig)
peak.region.400win=center.bed(a, upstreamWindow = 200, downstreamWindow = 200)
nrow(peak.region.400win)
## [1] 37308
head(peak.region.400win)
##     V1      V2      V3                V4      V5
## 1 chr1  827180  827581 GATA_ChIP_peak_28 9.30478
## 2 chr1  916569  916970 GATA_ChIP_peak_31 7.79887
## 3 chr1  924653  925054 GATA_ChIP_peak_33 3.78065
## 4 chr1  966453  966854 GATA_ChIP_peak_34 3.78065
## 5 chr1  999308  999709 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000336 1000737 GATA_ChIP_peak_37 5.52004

In the below chunk, we define a function to get raw counts from each sample/reps. This function uses bed.region.bpQuery.bigWig from bigWig package. It requires a bed region file that has the peak region that we want to analysis (right now we want to analyse peak without motifs 123456789); it also requires non-normalized, raw bigWig files (generated directly from seqOutbias) to allow bed.region.bpQuery.bigWig to query the counts information from. The output from this function will be: rows will be the same as bed file, columns will be each bigWig library, entries will be the raw counts.

#functions on github
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')

#function
get.counts.interval <- function(df, path.to.bigWig, file.prefix = 'H') {
    vec.names = c()
    inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
    
    for (mod.bigWig in Sys.glob(file.path(path.to.bigWig, paste(file.prefix, "*.bigWig", sep ='')))) {
        factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '.bigWig')[[1]][1]
        print(factor.name)
        vec.names = c(vec.names, factor.name)
        loaded.bw = load.bigWig(mod.bigWig)
        print(mod.bigWig)
        mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df, abs.value = TRUE)
        inten.df = cbind(inten.df, mod.inten)
    }
    colnames(inten.df) = vec.names
    r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
    row.names(inten.df) = r.names
    return(inten.df)
}

We will first use the defined function to query raw counts from each non-normalized bigWig files of GATA ChIP-seq use peak region info loaded before (data frame “peak.region.400win”).

library(bigWig)
#non-normalized counts
GATA.counts.df= get.counts.interval(peak.region.400win, "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/bigWigs/Seqoutbias_bw","MCF") #25 libraries
#nrow(peak.region.400win)
#[1] 37308
#nrow(GATA.counts.df)
#[1] 37308
#head(GATA.counts.df)
#colnames(GATA.counts.df)


GATA.analysis.regions=GATA.counts.df[,grepl("_GATA_",colnames(GATA.counts.df))] # get non-normalized counts from "GATA" libraries
#colnames(GATA.analysis.regions)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_CE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_CE_rep2" "MCF7_dTAGGATA522_GATA_CE_rep3"
#[7] "MCF7_dTAGGATA522_GATA_dE_rep1" "MCF7_dTAGGATA522_GATA_dE_rep2"
#[9] "MCF7_dTAGGATA522_GATA_dE_rep3"
identical(rownames(GATA.analysis.regions),rownames(GATA.counts.df)) # [1] TRUE

Then we use the DESeq2 package to make a counts matrix (DESeqDataSetFromMatrix) and calculate size factors for each library (estimateSizeFactorsForMatrix) use the previously calculated read depth for each library (“GATA_reads.txt”); We use this size factor to normalize the counts (sizeFactors).
Then we use rowMeans to average the normalized counts for the three GATA_CC reps, and save it as “peak.intensities”; this normalized counts now can be use to determine if a peak is intense or not.

library(DESeq2)
sample.conditions = factor(sapply(strsplit(colnames(GATA.analysis.regions), '_rep'), '[', 1))
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE
#[7] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#3 Levels: MCF7_dTAGGATA522_GATA_CC ... MCF7_dTAGGATA522_GATA_dE
deseq.counts.table = DESeqDataSetFromMatrix(countData = GATA.analysis.regions, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
                colData = as.data.frame(sample.conditions),
                design = ~ sample.conditions)


GATA.SF <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1] # GATA size factors from read depth
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
#                      33948616                      32585396
#  MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
#                      34475586                      51112588
#  MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
#                     147834968                     136838760
#  MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
#                      34136142                     136271358
#  MCF7_dTAGGATA522_GATA_dE_rep3
#                      85665512
GATA.size.factors = estimateSizeFactorsForMatrix(GATA.SF) # read depth transformed to size factor 
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 
#                    0.5385594                     0.5169334 
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1 
#                    0.5469193                     0.8108480 
#MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3 
#                    2.3452478                     2.1708044 
#MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2 
#                    0.5415343                     2.1618032 
#MCF7_dTAGGATA522_GATA_dE_rep3 
#                    1.3589941

                                       
sizeFactors(deseq.counts.table) <- GATA.size.factors # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table)
normalized.counts.GATA3 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3)
peak.intensities = rowMeans(normalized.counts.GATA3[,1:3]) # we want to get the average read counts for CC groups
names(peak.intensities) = rownames(normalized.counts.GATA3)
save.image('240108_GATA3_ChIP_deseq.Rdata') 

Subset peaks without the 8 GATA3-motifs, each with 20% intensity quantile

We can load the saved Rdata and look at each dataframe.

module load R/4.1.2 
R 
load('240108_GATA3_ChIP_deseq.Rdata')

Read this .bed file into R, and use DeSeq2 to count read size and parse into different quantile

head(peak.intensities)
#chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            45.32469             36.69993             30.44684 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            26.87016             21.22455             24.34263 
length(peak.intensities)
#[1] 37308
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
# 20%        40%        60%        80%       100% 
#  31.78709   40.05304   51.95512   80.03307 6270.21535 

violin plot

This violin plot is showing the distribution of peak intensity (log10).

library(lattice)
log10quantiles <- quantile(log(abs(peak.intensities), base = 10), probs = seq(.20, 1.00, by = .20))

png('violinplot_GATA3_ChIP_peak_normalized_intensity.png')
print(bwplot(log(abs(peak.intensities), base = 10) ~ factor("1"), 
       main = "Violin-Like Plot",
       panel = function(x, ...) {
         panel.violin(x, ...)
         panel.abline(h = log10quantiles, col = "red", lty = 2)
       },
       xlab = "", ylab = "log10 normalized Intensity")
)
dev.off()
normalized GATA3 ChIP peak intensity

Figure 6: normalized GATA3 ChIP peak intensity

This violin plot represents the distribution and probability density of (log10) normalized peak intensity (for peak without motif 1~8). Each red dotted line indicates the quantile cutoff of 20%, which can help us to visualize the peak intensity distribution across different quantile.  

Parse peaks into 5 intensity quantile by 20%.

chr = sapply(strsplit(names(peak.intensities), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200

quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))



# 1bp summit quantile file
j =0 
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))){
count = count +1

write.table(file = paste0('quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities > j & peak.intensities <= i], start[peak.intensities > j & peak.intensities <= i], end[peak.intensities > j & peak.intensities <= i], peak.intensities[peak.intensities > j & peak.intensities <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
for i in quantile*.bed
do
wc -l $i
done
##     7269 quantile0.2_summits.bed
##     7461 quantile0.4_summits.bed
##     7462 quantile0.6_summits.bed
##     7461 quantile0.8_summits.bed
##     7462 quantile1_summits.bed

3.1.2 Coherence check

See if the number of input samples will change the normalized reads.
I am loading fewer samples (only two condition, GATA_CC and GATA_dE)

GATA.analysis.regions2=GATA.analysis.regions[,c(1,2,3,7,8,9)]
colnames(GATA.analysis.regions2)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_dE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_dE_rep2" "MCF7_dTAGGATA522_GATA_dE_rep3"
library(DESeq2)
sample.conditions2 = factor(sapply(strsplit(colnames(GATA.analysis.regions2), '_rep'), '[', 1))
sample.conditions2
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#Levels: MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_dE

deseq.counts.table2 = DESeqDataSetFromMatrix(countData = GATA.analysis.regions2, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
                colData = as.data.frame(sample.conditions2),
                design = ~ sample.conditions2)


GATA.SF2 <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1][,c(1,2,3,7,8,9)]
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
#                      33948616                      32585396
#  MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1
#                      34475586                      34136142
#  MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
#                     136271358                      85665512
GATA.size.factors2 = estimateSizeFactorsForMatrix(GATA.SF2)
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 
#                    0.6821163                     0.6547257 
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_dE_rep1 
#                    0.6927045                     0.6858842 
#MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3 
#                    2.7380474                     1.7212438 

sizeFactors(deseq.counts.table2) <- GATA.size.factors2 # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table2)
normalized.counts.GATA3.2 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3.2)
peak.intensities2 = rowMeans(normalized.counts.GATA3.2[,1:3])

head(peak.intensities)
#  chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            45.32469             36.69993             30.44684 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            26.87016             21.22455             24.34263 
head(peak.intensities2)
#  chr1:827180-827581   chr1:916569-916970   chr1:924653-925054 
#            35.78574             28.97613             24.03906 
#  chr1:966453-966854   chr1:999308-999709 chr1:1000336-1000737 
#            21.21512             16.75767             19.21953 

save.image('240108_GATA3_ChIP_deseq_test.Rdata') 

The alteration in input samples affected the size factor determined by DESeq2’s estimateSizeFactorsForMatrix function, consequently impacting the normalized intensity of reads for each sample. However, I don’t believe it is altering the relative differences within each sample—such as which peak exhibits the highest intensity and which exhibits less.

Coherence check-follow up:

module load R/4.1.2 
R 
load('240108_GATA3_ChIP_deseq_test.Rdata')
chr = sapply(strsplit(names(peak.intensities2), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities2), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200

quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))



# 1bp summit quantile file
j =0 
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities2, probs = seq(.20, 1.00, by = .20))){
count = count +1

write.table(file = paste0('test_quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities2 > j & peak.intensities2 <= i], start[peak.intensities2 > j & peak.intensities2 <= i], end[peak.intensities2 > j & peak.intensities2 <= i], peak.intensities2[peak.intensities2 > j & peak.intensities2 <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}

See besides the normalized intensity, if peak coordinates assigned to each quantile is the same.

for i in test_quantile*.bed
do
wc -l $i
done
#7269 test_quantile0.2_summits.bed
#7461 test_quantile0.4_summits.bed
#7462 test_quantile0.6_summits.bed
#7461 test_quantile0.8_summits.bed
#7462 test_quantile1_summits.bed

awk '{OFS="\t"} {print $1, $2, $3}' quantile1_summits.bed > test1.bed
awk '{OFS="\t"} {print $1, $2, $3}' test_quantile1_summits.bed > test2.bed
diff test1.bed test2.bed 

No difference between the peak regions assigned to each quantile files, despite the difference in the normalized counts.

3.2 Closest GAT to peak summit

In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
Input2 -a is the sorted peak summit file (centered 1bp);
Input2 -b is the sorted, and concatenated GAT coordinates file (both plus and minus);

3.2.1 GAT coordinates file

GAT coordinates on full hg38 use read size==1000.

dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs1000/seqdump/
head ${dir}hg38.3.3.3minus.14_GAT.bed  
head ${dir}hg38.3.3.3plus.36_GAT.bed

head ${dir}hg38.3.3.3minus.36_ATC.bed  
head ${dir}hg38.3.3.3plus.14_ATC.bed

we will concatanate the plus and minus 3mer file together.

cat ${dir}hg38.3.3.3minus.14_GAT.bed ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.bed
cat ${dir}hg38.3.3.3minus.36_ATC.bed ${dir}hg38.3.3.3plus.14_ATC.bed > hg38.3.3.3.30_plus_minus_ATC.bed

wc -l ${dir}hg38.3.3.3minus.14_GAT.bed #38717592
wc -l ${dir}hg38.3.3.3plus.36_GAT.bed #39090736
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
#38717592+39090736
#[1] 77808328

wc -l ${dir}hg38.3.3.3minus.36_ATC.bed #39086237
wc -l ${dir}hg38.3.3.3plus.14_ATC.bed #38725373
wc -l hg38.3.3.3.30_plus_minus_ATC.bed #77811610

#39086237+38725373
#[1] 77811610
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #77808328
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1   10545   10548   14  14  -   GAT
#chr1   11145   11148   14  14  -   GAT
#chr1   11160   11163   14  14  -   GAT
#chr1   11576   11579   14  14  -   GAT
#chr1   11597   11600   14  14  -   GAT

load files contains 3mer coordinates info

#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
#    V1    V2    V3 V4 V5 V6  V7
#
tail(all.GAT.file)
#                             V1    V2    V3 V4 V5 V6  V7
#

nrow(all.GAT.file)
#[1] 

3.2.2 20% quantile peaks without motifs 12345678

load files contains selected ChIP peak summit info

chip.peak.summit.quantile1=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile1)
head(chip.peak.summit.quantile1)

chip.peak.summit.quantile0.8=read.table("quantile0.8_summits.bed", header=FALSE)
chip.peak.summit.quantile0.6=read.table("quantile0.6_summits.bed", header=FALSE)
chip.peak.summit.quantile0.4=read.table("quantile0.4_summits.bed", header=FALSE)
chip.peak.summit.quantile0.2=read.table("quantile0.2_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile0.8)
head(chip.peak.summit.quantile0.8)
summary(chip.peak.summit.quantile0.8$V4)
nrow(chip.peak.summit.quantile0.6)
head(chip.peak.summit.quantile0.6)
summary(chip.peak.summit.quantile0.6$V4)
nrow(chip.peak.summit.quantile0.4)
head(chip.peak.summit.quantile0.4)
summary(chip.peak.summit.quantile0.4$V4)
nrow(chip.peak.summit.quantile0.2)
head(chip.peak.summit.quantile0.2)
summary(chip.peak.summit.quantile0.2$V4)

bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.

# define function 
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
  
  options(scipen =99) # not use scientific notation when writing out
  
  #write bed formatted data.frames to tempfile
  write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
  
  # create the command string and call the command using system()
  # the command sort a and b file by coordinates
  command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
  cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
  try(system(command1))
  command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
  cat(command2,"\n")
  try(system(command2))
  
  # the command call closestBed on bed1 and bed2
  command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
  cat(command,"\n")
  try(system(command))
  
  res=read.table('out.file.bed',header=F, comment.char='')
  
  # remove intermediate files
  command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
  cat(command3,"\n")
  try(system(command3))
  
  colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
  return(res)
}

Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.

Find the closest GAT to peak summit regardless of GAT strandedness

closest.1stGAT.to.GATA.peak.quantile1=bedTools.closest(bed1 = chip.peak.summit.quantile1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.8=bedTools.closest(bed1 = chip.peak.summit.quantile0.8[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.6=bedTools.closest(bed1 = chip.peak.summit.quantile0.6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.4=bedTools.closest(bed1 = chip.peak.summit.quantile0.4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.GATA.peak.quantile0.2=bedTools.closest(bed1 = chip.peak.summit.quantile0.2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(closest.1stGAT.to.GATA.peak.quantile1,file= 'closest.1stGAT.to.GATA.peak.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.8,file= 'closest.1stGAT.to.GATA.peak.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.6,file= 'closest.1stGAT.to.GATA.peak.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.4,file= 'closest.1stGAT.to.GATA.peak.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.GATA.peak.quantile0.2,file= 'closest.1stGAT.to.GATA.peak.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('240108_GATA3_ChIP_clsestbed.Rdata') 

3.2.3 DHS negative controls

load files contains the 4 MCF7 DHS negative controls.

  • “full DHS control”

MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed #92388
MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed #139899

  • “independent DHS control”

MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed #69119
MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed #113774

# DHS negative ctrl (convert to 1bp summit at center)
library(bigWig)
full.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep1)
head(full.DHS.control.rep1)

full.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(full.DHS.control.rep2)
head(full.DHS.control.rep2)

indep.DHS.control.rep1=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep1_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep1)
head(indep.DHS.control.rep1)

indep.DHS.control.rep2=center.bed(read.table("../ENCODE_DHS_GSE29692/MCF7DHS_Rep2_161bp_noGATA_without_motifs_123456_78.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(indep.DHS.control.rep2)
head(indep.DHS.control.rep2)
closest.1stGAT.to.full.DHS.control.rep1=bedTools.closest(bed1 = full.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.full.DHS.control.rep2=bedTools.closest(bed1 = full.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep1=bedTools.closest(bed1 = indep.DHS.control.rep1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.indep.DHS.control.rep2=bedTools.closest(bed1 = indep.DHS.control.rep2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')


write.table(closest.1stGAT.to.full.DHS.control.rep1,file= 'closest.1stGAT.to.full.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.full.DHS.control.rep2,file= 'closest.1stGAT.to.full.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep1,file= 'closest.1stGAT.to.indep.DHS.control.rep1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.indep.DHS.control.rep2,file= 'closest.1stGAT.to.indep.DHS.control.rep2.bed', quote=F,sep="\t",col.names=F,row.names=F)


save.image('240108_GATA3_ChIP_clsestbed2.Rdata') 

3.2.4 positive controls

load files contains the 4 positive controls (GATA3 peak set with enriched GATA3-like motif of known structures).

GATA3_peak_161win_with_motif_1.bed
GATA3_peak_161win_with_motif_2.bed
GATA3_peak_161win_with_motif_4.bed
GATA3_peak_161win_with_motif_5.bed

# positive control
library(bigWig)
pos.control.peak.with1=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_1.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with1)
head(pos.control.peak.with1)

pos.control.peak.with2=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_2.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with2)
head(pos.control.peak.with2)

pos.control.peak.with4=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_4.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with4)
head(pos.control.peak.with4)

pos.control.peak.with5=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_5.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with5)
head(pos.control.peak.with5)

pos.control.peak.with6=center.bed(read.table("../MAST_positive_control/GATA3_peak_161win_with_motif_6.bed", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
nrow(pos.control.peak.with6)
head(pos.control.peak.with6)
closest.1stGAT.to.pos.control.motif1=bedTools.closest(bed1 = pos.control.peak.with1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif2=bedTools.closest(bed1 = pos.control.peak.with2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif4=bedTools.closest(bed1 = pos.control.peak.with4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif5=bedTools.closest(bed1 = pos.control.peak.with5[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
closest.1stGAT.to.pos.control.motif6=bedTools.closest(bed1 = pos.control.peak.with6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')

write.table(closest.1stGAT.to.pos.control.motif1,file= 'closest.1stGAT.to.pos.control.motif1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif2,file= 'closest.1stGAT.to.pos.control.motif2.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif4,file= 'closest.1stGAT.to.pos.control.motif4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif5,file= 'closest.1stGAT.to.pos.control.motif5.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(closest.1stGAT.to.pos.control.motif6,file= 'closest.1stGAT.to.pos.control.motif6.bed', quote=F,sep="\t",col.names=F,row.names=F)

save.image('240108_GATA3_ChIP_clsestbed3.Rdata') 

3.2.5 CDFs

We are generating a cumulative distribution function to assess the proximity of a 3mer GAT to the summit of GATA3 ChIP peaks, potentially utilizing 3mer GAT as a binding sequence, across five quantile files. This evaluation aims to determine if the 3mer GAT sequences are closer to the summit of GATA3 ChIP peaks than they are to the summit of the MCF7 DHS regions, which consist of four files. Additionally, we possess four positive control peaks containing GATA3 motifs with known structures.

3.2.5.1 prepare plotting data frame

In previous analysis, we have used closestBed to find the closest GAT to peak summit (5 quantile peak sets, 4 MCF7 DHS negative control, and 4 positive controls).

GATA3 peaks
closest.1stGAT.to.GATA.peak.quantile0.2.bed
closest.1stGAT.to.GATA.peak.quantile0.4.bed
closest.1stGAT.to.GATA.peak.quantile0.6.bed
closest.1stGAT.to.GATA.peak.quantile0.8.bed
closest.1stGAT.to.GATA.peak.quantile1.bed

These files saved info of 1st closest GAT to GATA3 peak summits. These GATA3 peaks are peaks without MEME/MOTIF found motifs and have ranked by peak intensity.

Notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.

closest.1stGAT.to.GATA.peak.quantile1=read.table('closest.1stGAT.to.GATA.peak.quantile1.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.8=read.table('closest.1stGAT.to.GATA.peak.quantile0.8.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.6=read.table('closest.1stGAT.to.GATA.peak.quantile0.6.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.4=read.table('closest.1stGAT.to.GATA.peak.quantile0.4.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4
closest.1stGAT.to.GATA.peak.quantile0.2=read.table('closest.1stGAT.to.GATA.peak.quantile0.2.bed',header=F, comment.char='')
head(closest.1stGAT.to.GATA.peak.quantile1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36  + GAT  21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36  + GAT   3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14  - GAT   6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36  + GAT   8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36  + GAT  32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14  - GAT   4

negative control
closest.1stGAT.to.full.DHS.control.rep1.bed
closest.1stGAT.to.indep.DHS.control.rep1.bed
closest.1stGAT.to.full.DHS.control.rep2.bed
closest.1stGAT.to.indep.DHS.control.rep2.bed

closest.1stGAT.to.full.DHS.control.rep1=read.table('closest.1stGAT.to.full.DHS.control.rep1.bed',header=F, comment.char='')
closest.1stGAT.to.full.DHS.control.rep2=read.table('closest.1stGAT.to.full.DHS.control.rep2.bed',header=F, comment.char='')


closest.1stGAT.to.indep.DHS.control.rep1=read.table('closest.1stGAT.to.indep.DHS.control.rep1.bed',header=F, comment.char='')
closest.1stGAT.to.indep.DHS.control.rep2=read.table('closest.1stGAT.to.indep.DHS.control.rep2.bed',header=F, comment.char='')


head(closest.1stGAT.to.full.DHS.control.rep1)
##     V1     V2     V3   V4     V5     V6 V7 V8 V9 V10 V11
## 1 chr1 268044 268045 chr1 268046 268049 36 36  + GAT   2
## 2 chr1 586135 586136 chr1 586125 586128 36 36  + GAT   8
## 3 chr1 629975 629976 chr1 630013 630016 14 14  - GAT  38
## 4 chr1 630235 630236 chr1 630266 630269 14 14  - GAT  31
## 5 chr1 630555 630556 chr1 630559 630562 14 14  - GAT   4
## 6 chr1 631455 631456 chr1 631414 631417 14 14  - GAT  39
nrow(closest.1stGAT.to.full.DHS.control.rep1)
## [1] 92388
nrow(closest.1stGAT.to.full.DHS.control.rep1[closest.1stGAT.to.full.DHS.control.rep1$V9=="+",])
## [1] 46316
nrow(closest.1stGAT.to.full.DHS.control.rep1[closest.1stGAT.to.full.DHS.control.rep1$V9=="-",])
## [1] 46072

positive control
closest.1stGAT.to.pos.control.motif1.bed
closest.1stGAT.to.pos.control.motif2.bed
closest.1stGAT.to.pos.control.motif4.bed
closest.1stGAT.to.pos.control.motif5.bed
closest.1stGAT.to.pos.control.motif6.bed

closest.1stGAT.to.pos.control.motif1=read.table('closest.1stGAT.to.pos.control.motif1.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif2=read.table('closest.1stGAT.to.pos.control.motif2.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif4=read.table('closest.1stGAT.to.pos.control.motif4.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif5=read.table('closest.1stGAT.to.pos.control.motif5.bed',header=F, comment.char='')
closest.1stGAT.to.pos.control.motif6=read.table('closest.1stGAT.to.pos.control.motif6.bed',header=F, comment.char='')

head(closest.1stGAT.to.pos.control.motif1)
##     V1      V2      V3   V4      V5      V6 V7 V8 V9 V10 V11
## 1 chr1 1080556 1080557 chr1 1080566 1080569 36 36  + GAT  10
## 2 chr1 1238237 1238238 chr1 1238225 1238228 14 14  - GAT  10
## 3 chr1 1686431 1686432 chr1 1686429 1686432 36 36  + GAT   0
## 4 chr1 1741130 1741131 chr1 1741128 1741131 36 36  + GAT   0
## 5 chr1 2002021 2002022 chr1 2002022 2002025 14 14  - GAT   1
## 6 chr1 3127377 3127378 chr1 3127380 3127383 36 36  + GAT   3
nrow(closest.1stGAT.to.pos.control.motif1)
## [1] 13273
nrow(closest.1stGAT.to.pos.control.motif1[closest.1stGAT.to.pos.control.motif1$V9=="+",])
## [1] 6700
nrow(closest.1stGAT.to.pos.control.motif1[closest.1stGAT.to.pos.control.motif1$V9=="-",])
## [1] 6573

combine data

GATA3 peak info + distance + status.

df.chip = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.GATA.peak.quantile*.bed"))) {
    print(chip.peak)
    quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.GATA.peak.')[[1]][2]), ".bed")[[1]][1]
    print(quantile.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
    df.chip = rbind(df.chip, all.distance)
}
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./closest.1stGAT.to.GATA.peak.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./closest.1stGAT.to.GATA.peak.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame':    37115 obs. of  5 variables:
##  $ V1           : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2           : int  924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
##  $ V3           : int  924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
##  $ V11          : int  0 65 61 91 46 143 37 29 213 120 ...
##  $ quantile.name: chr  "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
head(df.chip)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2

negative control (MCF7 DHS regions) info + distance +status.

df.neg = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.neg) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./*DHS.control.rep*.bed"))) {
    print(chip.peak)
    rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".bed")[[1]][1]
    print(rep.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)],  rep.name) 
    df.neg = rbind(df.neg, all.distance)
}
## [1] "./closest.1stGAT.to.full.DHS.control.rep1.bed"
## [1] "full.DHS.control.rep1"
## [1] "./closest.1stGAT.to.full.DHS.control.rep2.bed"
## [1] "full.DHS.control.rep2"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep1.bed"
## [1] "indep.DHS.control.rep1"
## [1] "./closest.1stGAT.to.indep.DHS.control.rep2.bed"
## [1] "indep.DHS.control.rep2"
str(df.neg)
## 'data.frame':    415180 obs. of  5 variables:
##  $ V1      : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2      : int  268044 586135 629975 630235 630555 631455 631735 632315 633015 634475 ...
##  $ V3      : int  268045 586136 629976 630236 630556 631456 631736 632316 633016 634476 ...
##  $ V11     : int  2 8 38 31 4 39 44 4 11 0 ...
##  $ rep.name: chr  "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" "full.DHS.control.rep1" ...
colnames(df.neg) = c("V1","V2","V3", "dis", "status")
unique(df.neg$status)
## [1] "full.DHS.control.rep1"  "full.DHS.control.rep2"  "indep.DHS.control.rep1"
## [4] "indep.DHS.control.rep2"
head(df.neg)
##     V1     V2     V3 dis                status
## 1 chr1 268044 268045   2 full.DHS.control.rep1
## 2 chr1 586135 586136   8 full.DHS.control.rep1
## 3 chr1 629975 629976  38 full.DHS.control.rep1
## 4 chr1 630235 630236  31 full.DHS.control.rep1
## 5 chr1 630555 630556   4 full.DHS.control.rep1
## 6 chr1 631455 631456  39 full.DHS.control.rep1

positive control info + distance +status.

df.pos = data.frame(matrix(nrow = 0, ncol = 5))     
colnames(df.pos) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./closest.1stGAT.to.pos.control*.bed"))) {
    print(chip.peak)
    rep.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.1stGAT.to.')[[1]][2]), ".bed")[[1]][1]
    print(rep.name)
    all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)],  rep.name) 
    df.pos = rbind(df.pos, all.distance)
}
## [1] "./closest.1stGAT.to.pos.control.motif1.bed"
## [1] "pos.control.motif1"
## [1] "./closest.1stGAT.to.pos.control.motif2.bed"
## [1] "pos.control.motif2"
## [1] "./closest.1stGAT.to.pos.control.motif4.bed"
## [1] "pos.control.motif4"
## [1] "./closest.1stGAT.to.pos.control.motif5.bed"
## [1] "pos.control.motif5"
## [1] "./closest.1stGAT.to.pos.control.motif6.bed"
## [1] "pos.control.motif6"
str(df.pos)
## 'data.frame':    50881 obs. of  5 variables:
##  $ V1      : chr  "chr1" "chr1" "chr1" "chr1" ...
##  $ V2      : int  1080556 1238237 1686431 1741130 2002021 3127377 3703557 3717481 4649719 5648934 ...
##  $ V3      : int  1080557 1238238 1686432 1741131 2002022 3127378 3703558 3717482 4649720 5648935 ...
##  $ V11     : int  10 10 0 0 1 3 1 0 2 1 ...
##  $ rep.name: chr  "pos.control.motif1" "pos.control.motif1" "pos.control.motif1" "pos.control.motif1" ...
colnames(df.pos) = c("V1","V2","V3", "dis", "status")
unique(df.pos$status)
## [1] "pos.control.motif1" "pos.control.motif2" "pos.control.motif4"
## [4] "pos.control.motif5" "pos.control.motif6"
head(df.pos)
##     V1      V2      V3 dis             status
## 1 chr1 1080556 1080557  10 pos.control.motif1
## 2 chr1 1238237 1238238  10 pos.control.motif1
## 3 chr1 1686431 1686432   0 pos.control.motif1
## 4 chr1 1741130 1741131   0 pos.control.motif1
## 5 chr1 2002021 2002022   1 pos.control.motif1
## 6 chr1 3127377 3127378   3 pos.control.motif1

merge files.

df.all=rbind(df.chip, df.neg, df.pos)
unique(df.all$status)
##  [1] "quantile0.2"            "quantile0.4"            "quantile0.6"           
##  [4] "quantile0.8"            "quantile1"              "full.DHS.control.rep1" 
##  [7] "full.DHS.control.rep2"  "indep.DHS.control.rep1" "indep.DHS.control.rep2"
## [10] "pos.control.motif1"     "pos.control.motif2"     "pos.control.motif4"    
## [13] "pos.control.motif5"     "pos.control.motif6"
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2", "pos.control.motif1", "pos.control.motif2", "pos.control.motif4", "pos.control.motif5", "pos.control.motif6"))
head(df.all)
##     V1      V2      V3 dis      status
## 1 chr1  924853  924854   0 quantile0.2
## 2 chr1  966653  966654  65 quantile0.2
## 3 chr1  999508  999509  61 quantile0.2
## 4 chr1 1000536 1000537  91 quantile0.2
## 5 chr1 1001891 1001892  46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 503176

3.2.5.2 CDF plot

negative control comparison
First, we want to ask if the MCF7 DHS controls comparable to each other.

library(lattice)
library(latticeExtra)

df.neg$status = factor(df.neg$status, levels = c("full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2"))

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.neg,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)),
         aspect = 1,
         #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("blue4","blue"))(2), colorRampPalette(c("grey60","grey30"))(2)), lwd=3), strip.background=list(col="grey85"))
         )

The blue traces are the two reps from full MCF7 DHS regions, the grey traces are the two reps from MCF7 HDS regions without overlapped GATA3 peak regions.
Four negative traces looks very similar to each other.

Since the independent DHS control is the one we removed all overlapped GATA3 ChIP peak regions, we expect that this two sets, compared to its complete version, more towards right (in other word, has less GAT 3mer close to its summit). let’s check:

rep1

library(lattice)
library(latticeExtra)

df.neg.rep1=rbind(df.neg[df.neg$status=="full.DHS.control.rep1",], df.neg[df.neg$status=="indep.DHS.control.rep1",])

df.neg.rep1$status = factor(df.neg.rep1$status, levels = c("full.DHS.control.rep1", "indep.DHS.control.rep1"))

ecdfplot(~abs(dis), groups = status, data = df.neg.rep1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("blue4","grey60"),
         aspect = 1,
         #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0, 500),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("blue4","grey60"), lwd=3), strip.background=list(col="grey85"))
         )

The two traces are mostly overlapped. By eye, grey trace (indep) is a little towards right than blue trace (full).
Further validate if these traces converge/parallel to each other.

# Specify categories to compare within df.neg.rep1
match = ecdf(abs(df.neg.rep1$dis)[df.neg.rep1$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.neg.rep1$dis)[df.neg.rep1$status == 'full.DHS.control.rep1'])
match.y = seq(0, 1600, by=1) # creating indices
rep.y = seq(0, 1600, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 38
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'indep.DHS.control.rep1 CDF - full.DHS.control.rep1 CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'indep.DHS.control.rep1 CDF - full.DHS.control.rep1 CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

rep2

library(lattice)
library(latticeExtra)

df.neg.rep2=rbind(df.neg[df.neg$status=="full.DHS.control.rep2",], df.neg[df.neg$status=="indep.DHS.control.rep2",])

df.neg.rep2$status = factor(df.neg.rep2$status, levels = c("full.DHS.control.rep2", "indep.DHS.control.rep2"))

ecdfplot(~abs(dis), groups = status, data = df.neg.rep2,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("blue","grey30"),
         aspect = 1,
         #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0, 500),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("blue","grey30"), lwd=3), strip.background=list(col="grey85"))
         )

The two traces are mostly overlapped.

Further validate if these traces converge/parallel to each other.

# Specify categories to compare within df.neg.rep2
match = ecdf(abs(df.neg.rep2$dis)[df.neg.rep2$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.neg.rep2$dis)[df.neg.rep2$status == 'full.DHS.control.rep2'])
match.y = seq(0, 1600, by=1) # creating indices
rep.y = seq(0, 1600, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'indep.DHS.control.rep2 CDF - full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'indep.DHS.control.rep2 CDF - full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

Comparing between rep1 and rep2.

# Specify categories 
match = ecdf(abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.neg$dis)[df.neg$status == 'full.DHS.control.rep1'])
match.y = seq(0, 2300, by=1) # creating indices
rep.y = seq(0, 2300, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 32
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'full.DHS.control.rep2 CDF - full.DHS.control.rep1 CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'full.DHS.control.rep2 CDF - full.DHS.control.rep1 CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

In summary, within each replicate, the independent controls predominantly overlap with the full control. Notably, the full control exhibits a slight leftward shift compared to the independent control. This aligns with expectations, considering that the independent control omits the overlapped GATA3 peak regions, resulting in reduced bias toward GAT 3mer in proximity to the summit. Additionally, the fluctuations observed in the differences between the independent trace CDF and the full trace CDF imply minimal discrepancies.

On the other hand, when comparing the negative controls between replicate 2 and replicate 1, it’s discernible that GAT is closer to the summit of replicate 1 than that of replicate 2.

all in one plot

library(lattice)
library(latticeExtra)

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(5)), lwd=3), strip.background=list(col="grey85"))
         #panel = function(...) {
         #    panel.abline(v= 1.255, lty =2) #log10(17)=1.23 #log10(17)=1.255
         #    panel.ecdfplot(...)
         #}
)

The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (grey: MCF7 DHS regions; green: positive control).

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.pos,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("green","darkgreen"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("green","darkgreen"))(5)), lwd=3), strip.background=list(col="grey85"))
         #panel = function(...) {
         #    panel.abline(v= 1.255, lty =2) #log10(17)=1.23 #log10(17)=1.255
         #    panel.ecdfplot(...)
         #}
)

df1=rbind(df.neg, df.pos)
df1$status = factor(df1$status, levels = c("full.DHS.control.rep1", "full.DHS.control.rep2", "indep.DHS.control.rep1", "indep.DHS.control.rep2", "pos.control.motif1", "pos.control.motif2", "pos.control.motif4", "pos.control.motif5", "pos.control.motif6"))

ecdfplot(~log(abs(dis), base = 10), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c(colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(5)),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,2.5),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c(colorRampPalette(c("grey85","grey30"))(4), colorRampPalette(c("green","darkgreen"))(5)), lwd=3), strip.background=list(col="grey85"))
         #panel = function(...) {
         #    panel.abline(v= 1.255, lty =2) #log10(17)=1.23 #log10(17)=1.255
         #    panel.ecdfplot(...)
         #}
)

A coherence check on the five positive control vs. the four negative controls. We can observe a remarkble left-shift in the cdf plot, indicates that GAT is closer to peak summit in GATA3 peaks.

3.2.5.3 empirically determine binding distance for each quantile–quantile1

We can empirically determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).

unique(df.all$status)
##  [1] quantile0.2            quantile0.4            quantile0.6           
##  [4] quantile0.8            quantile1              full.DHS.control.rep1 
##  [7] full.DHS.control.rep2  indep.DHS.control.rep1 indep.DHS.control.rep2
## [10] pos.control.motif1     pos.control.motif2     pos.control.motif4    
## [13] pos.control.motif5     pos.control.motif6    
## 14 Levels: quantile0.2 quantile0.4 quantile0.6 quantile0.8 ... pos.control.motif6

Comparing between MCF7 DHS regions (neg control) and GATA3 peak in quantile1 to determined the converging/parallel distance cutoff at ~13bp.

There are 4 negative control, we need to determine one by one:

levels(df.neg$status)
## [1] "full.DHS.control.rep1"  "full.DHS.control.rep2"  "indep.DHS.control.rep1"
## [4] "indep.DHS.control.rep2"

1) quantile1 vs. “full.DHS.control.rep1”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "full.DHS.control.rep1"])
# x[1:1535] =      0,      1,      2,  ..., 7.1132e+05, 7.3298e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 17; 
## [1] 17
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

By plotting the difference between the CDFs of GATA3 peaks (quantile 1) and the negative control (full.DHS.control.rep1), we can pinpoint the point where these two CDF curves converge or reach a similar trajectory.

When comparing GATA3 peaks within quantile 1 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 17 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile1", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 17, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (17%):

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 17) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.1723328

2) quantile1 vs. “full.DHS.control.rep2”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "full.DHS.control.rep2"])
# x[1:2283] =      0,      1,      2,  ..., 7.33e+05, 7.3424e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 18; 
## [1] 18
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 18 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile1", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 18, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 18, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (19.8%):

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 18) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 18) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.1985438

3) quantile1 vs. “indep.DHS.control.rep1”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "indep.DHS.control.rep1"])
# x[1:1449] =      0,      1,      2,  ..., 7.1132e+05, 7.3298e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 17; 
## [1] 17
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 17 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile1", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 17, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (17.6%):

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 17) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.1759796

4) quantile1 vs. “indep.DHS.control.rep2”

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "indep.DHS.control.rep2"])
# x[1:2207] =      0,      1,      2,  ..., 7.33e+05, 7.3424e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile1"])
# x[1:187] =      0,      1,      2,  ...,    359,    425
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 18; 
## [1] 18
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile1) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 1 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 18 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile1",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile1", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 18, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 1 compared with the negative control. At the marked red dotted line positioned at distance 17, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (20%):

frac1=sum(df.chip[df.chip$status=="quantile1",]$dis <= 17) / length(df.chip[df.chip$status=="quantile1",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 17) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.2000822

3.2.5.4 empirically determine binding distance for each quantile–quantile0.8

1) quantile0.8 vs. “full.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
#match
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "full.DHS.control.rep1"])
#x[1:1535] =      0,      1,      2,  ..., 7.1132e+05, 7.3298e+05
#rep
#Empirical CDF 
#Call: ecdf(abs(df.all$dis)[df.all$status == "quantile0.8"])
#x[1:243] =      0,      1,      2,  ...,    337,    403
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.8 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.8",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.8", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 8, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.8 compared with the negative control. At the marked red dotted line positioned at distance 8, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (5.27%):

frac1=sum(df.chip[df.chip$status=="quantile0.8",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.8",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 8) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.05267348

2) quantile0.8 vs. “full.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.8 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 9 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.8",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.8", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 9, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.8 compared with the negative control. At the marked red dotted line positioned at distance 9, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT 7%):

frac1=sum(df.chip[df.chip$status=="quantile0.8",]$dis <= 9) / length(df.chip[df.chip$status=="quantile0.8",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 9) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.07280715

3) quantile0.8 vs. “indep.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.8 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.8",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.8", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 8, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.8 compared with the negative control. At the marked red dotted line positioned at distance 8, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (5.5%):

frac1=sum(df.chip[df.chip$status=="quantile0.8",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.8",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 8) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.05558396

4) quantile0.8 vs. “indep.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 9
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.8) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.8 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 9 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.8",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.8", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 9, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.8 compared with the negative control. At the marked red dotted line positioned at distance 9, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (7.48%):

frac1=sum(df.chip[df.chip$status=="quantile0.8",]$dis <= 9) / length(df.chip[df.chip$status=="quantile0.8",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 9) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.07482369

3.2.5.5 empirically determine binding distance for each quantile–quantile0.6

1) quantile0.6 vs. “full.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 7
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.6 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 7 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.6",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.6", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 7, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.6 compared with the negative control. At the marked red dotted line positioned at distance 7, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (3.5%):

frac1=sum(df.chip[df.chip$status=="quantile0.6",]$dis <= 7) / length(df.chip[df.chip$status=="quantile0.6",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 7) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.03499076

2) quantile0.6 vs. “full.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 11
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.6 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 11 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.6",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.6", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 11, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.6 compared with the negative control. At the marked red dotted line positioned at distance 11, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (5.7%):

frac1=sum(df.chip[df.chip$status=="quantile0.6",]$dis <= 11) / length(df.chip[df.chip$status=="quantile0.6",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 11) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.05668019

3) quantile0.6 vs. “indep.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.6 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.6",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.6", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 8, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.6 compared with the negative control. At the marked red dotted line positioned at distance 8, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (3.8%):

frac1=sum(df.chip[df.chip$status=="quantile0.6",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.6",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 8) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.03798305

4) quantile0.6 vs. “indep.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 11
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.6) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.6 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 11 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.6",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.6", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 11, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.6 compared with the negative control. At the marked red dotted line positioned at distance 11, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (5.85%):

frac1=sum(df.chip[df.chip$status=="quantile0.6",]$dis <= 11) / length(df.chip[df.chip$status=="quantile0.6",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 11) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.05853615

3.2.5.6 empirically determine binding distance for each quantile–quantile0.4

1) quantile0.4 vs. “full.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])

match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 4
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.4 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 4 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.4",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.4", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 4, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, the left-shift is not as prominent as previous few cdf plots. At the marked red dotted line positioned at distance 4, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (1.4%):

frac1=sum(df.chip[df.chip$status=="quantile0.4",]$dis <= 4) / length(df.chip[df.chip$status=="quantile0.4",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 4) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.0137796

2) quantile0.4 vs. “full.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.4 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.4",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.4", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 8, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.4 compared with the negative control. At the marked red dotted line positioned at distance 8, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (2.9%):

#df.chip[df.chip$status=="quantile0.4",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile0.4",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.4",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 8) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.02930899

3) quantile0.4 vs. “indep.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 5
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.4 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.4",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.4", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 5, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, left-shift is not that remarkable. At the marked red dotted line positioned at distance 5, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (1.4%):

#df.chip[df.chip$status=="quantile0.4",]
#df.neg[df.neg$status=="full.DHS.control.rep1",]

frac1=sum(df.chip[df.chip$status=="quantile0.4",]$dis <= 5) / length(df.chip[df.chip$status=="quantile0.4",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 5) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.0139891

4) quantile0.4 vs. “indep.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 8
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.4) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.4 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.4",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.4", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 8, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.4 compared with the negative control. At the marked red dotted line positioned at distance 9, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (3.1%):

frac1=sum(df.chip[df.chip$status=="quantile0.4",]$dis <= 8) / length(df.chip[df.chip$status=="quantile0.4",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 8) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.03100362

3.2.5.7 empirically determine binding distance for each quantile–quantile0.2

1) quantile0.2 vs. “full.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 3
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.2 to the full.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 3 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.2",], df.neg[df.neg$status=="full.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.2", "full.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "blue4"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "blue4"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 3, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, left-shift is not prominent. At the marked red dotted line positioned at distance 3, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (0.3%):

frac1=sum(df.chip[df.chip$status=="quantile0.2",]$dis <= 3) / length(df.chip[df.chip$status=="quantile0.2",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis <= 3) / length(df.neg[df.neg$status=="full.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.003373368

2) quantile0.2 vs. “full.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'full.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) 
## [1] 10
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(full.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.2 to the full.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 10 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.2",], df.neg[df.neg$status=="full.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.2", "full.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("#ce228e", "blue"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("#ce228e", "blue"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 10, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.2 compared with the negative control. At the marked red dotted line positioned at distance 10, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (2.3%):

frac1=sum(df.chip[df.chip$status=="quantile0.2",]$dis <= 10) / length(df.chip[df.chip$status=="quantile0.2",]$dis)
frac2=sum(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis <= 10) / length(df.neg[df.neg$status=="full.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.02349353

3) quantile0.2 vs. “indep.DHS.control.rep1”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 4
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control.rep1) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.2 to the indep.DHS.control.rep1 (MCF7 cells), we observe that at a distance of 8 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.2",], df.neg[df.neg$status=="indep.DHS.control.rep1",])
df1$status = factor(df1$status, levels = c("quantile0.2", "indep.DHS.control.rep1"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey60"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey60"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 4, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, the left-shift is not prominent. At the marked red dotted line positioned at distance 4, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (0.6%):

frac1=sum(df.chip[df.chip$status=="quantile0.2",]$dis <= 4) / length(df.chip[df.chip$status=="quantile0.2",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis <= 4) / length(df.neg[df.neg$status=="indep.DHS.control.rep1",]$dis)
frac1-frac2
## [1] 0.00596207

4) quantile0.2 vs. “indep.DHS.control.rep2”:

# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'indep.DHS.control.rep2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 2000, by=1) # creating indices
rep.y = seq(0, 2000, by=1)

spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)

print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])  
## [1] 10
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,60),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
     xlim = c(0,300),
     cex=0.7,
     xlab = '3mer-GAT Distance from peak summit (bp)',
     ylab = 'GATA3 peak (quantile0.2) CDF - negative ctrl(indep.DHS.control.rep2) CDF')
     abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
      lines(pred[[1]], pred[[2]], col = 'blue')

When comparing GATA3 peaks within quantile 0.2 to the indep.DHS.control.rep2 (MCF7 cells), we observe that at a distance of 10 from the GATA3 peak summit, there are peaks exhibiting a closer resemblance to the GAT 3mer.

library(lattice)
library(latticeExtra)

df1=rbind(df.chip[df.chip$status=="quantile0.2",], df.neg[df.neg$status=="indep.DHS.control.rep2",])
df1$status = factor(df1$status, levels = c("quantile0.2", "indep.DHS.control.rep2"))


ecdfplot(~abs(dis), groups = status, data = df1,
         auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
         col=c("red", "grey30"),
         aspect = 1,
        #xlim = c(0, 50000),
         scales=list(relation="free",alternating=c(1,1,1,1)),
         ylab = 'Cumulative Distribution Function',
         xlab = expression('3mer-GAT Distance from peak summit'),
                                        #index.cond = list(c(2,1)),
         between=list(y=1.0),
         type = 'a',
         xlim = c(0,60),
         lwd=2,
         lty=c(1),
         par.settings = list(superpose.line = list(col=c("red", "grey30"), lwd=3), strip.background=list(col="grey85")),
         panel = function(...) {
             panel.abline(v= 10, lty =2, col="red")
             #panel.abline(v= 18, lty =2, col="grey30")
             panel.ecdfplot(...)
         })

In the above CDF plot, a noticeable left-shift is evident in the closest GAT distance to GATA3 peaks within quantile 0.2 compared with the negative control. At the marked red dotted line positioned at distance 10, we empirically identify the point where the two traces align in parallel.

We can determine the fraction of peaks with closest GAT (2.5%):

frac1=sum(df.chip[df.chip$status=="quantile0.2",]$dis <= 10) / length(df.chip[df.chip$status=="quantile0.2",]$dis)
frac2=sum(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis <= 10) / length(df.neg[df.neg$status=="indep.DHS.control.rep2",]$dis)
frac1-frac2
## [1] 0.02545599

3.2.6 barchart summary

Summarize the fraction of peaks (in each quantile) that has a closest GAT to its peak summits.

#module load R/4.1.2
#R
library(lattice)
bar=read.csv(file ="barchart_sum.csv", header = T)
str(bar)
bar$lib=factor(bar$lib, levels=c("quantile1","quantile0.8","quantile0.6","quantile0.4","quantile0.2",))

pdf('240112_fraction_of_peaks_have_closer_GAT.pdf', width=10,height=8)
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(fraction ~ lib | negctrl,        
         data = bar,
         groups = lib,
         stack = TRUE,
         ylim=c(0,0.25),
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "fraction of peak set with closer GAT to summit",
         xlab = "peaks without motifs in quantiles",
         par.settings = my.settings)
)
dev.off()
fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls

Figure 7: fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls

library(lattice)
bar=read.csv(file ="barchart_sum.csv", header = T)
str(bar)
bar$lib=factor(bar$lib, levels=c("quantile1","quantile0.8","quantile0.6","quantile0.4","quantile0.2",))

pdf('240112_fraction_of_peaks_have_closer_GAT_2.pdf', width=8,height=6)
my.settings <- list(
  #superpose.polygon=list(col=c(colorRampPalette(c("red","pink"))(6),colorRampPalette(c("blue","light blue"))(3), "light grey"), border="transparent"),
  superpose.polygon=list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e"), border="transparent"),
  strip.background=list(col="grey80", cex = 0.6),
  strip.border=list(col="black")
)
print(barchart(fraction ~ negctrl,        
         data = bar,
         groups = lib,
         stack = TRUE,
         ylim=c(0,0.50),
         auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         ylab = "fraction of peak set with closer GAT to summit",
         xlab = "peaks without motifs in quantiles",
         par.settings = my.settings)
)
dev.off()
fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls2

Figure 8: fraction of peaks in each quantile with closer GAT compared to 4 different negative ctrls2

3.3 find the second closest GAT

In this section, we need to consider two things:
One is to define spacing/distance as the relative position of two zinc finger;
the other is to separate the cases with different strand orientation.

4 follow-up

4.1 MACS3 called peaks:

  •   Is IgG read depth comparable to GATA3 read depth? –YES \
#cd GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/read_depth/
#module load R/4.1.2
#R 
library("lattice") 
df_GATA=read.csv('GATA_sorted_bam_reads.csv')
df_IgG=read.csv('IgG_sorted_bam_reads.csv')

df_sum=rbind(df_GATA[1:3,], df_IgG)
df_sum$supp=c(rep("GATA",3), rep("IgG", 4))
df_sum$rep=c(rep(c(1,2,3),2 ),4)
df_sum$supp2=as.factor(paste(df_sum$treatment,df_sum$rep, sep = "_"))

pdf('barplot_sorted_bam_GATA_IgG_reads1.pdf',width=6,height=8)
print(barchart(aligned_reads~ supp,
         group=treatment,
         stack=T,
         data = df_sum,
         #auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         xlab = "libraries",
         ylab = "post-alignment reads (sorted_bam)",
))
dev.off()

pdf('barplot_sorted_bam_GATA_IgG_reads.pdf',width=10,height=8)
print(barchart(aligned_reads~ supp2|supp,
         group=treatment,
         stack=T,
         data = df_sum,
         #auto.key=list(space="right"),
         scales = list(x = list(rot = 45)),
         xlab = "libraries",
         ylab = "post-alignment reads (sorted_bam)",
))
dev.off()
stacked bar plot of GATA3 libraries and IgG libraries

Figure 9: stacked bar plot of GATA3 libraries and IgG libraries

The read depth in most libraries between GATA3 and IgG is comparable. However, IgG CC_rep2 exhibits higher read depth due to its sequencing in the second pool, which contains fewer samples.

  •   Saturation curve (reads vs. number of peaks) \
  1. Randomly subset the GATA3 ChIP-seq data from 10% to 100%, with three replicates for each percentage.
  2. Randomly subset the control IgG ChIP-seq data from 10% to 100%, with four replicates for each percentage.
  3. Utilize MACS3 with the same parameters used for calling GATA3 peaks previously to detect peaks for each subsetted percentage.
  4. Count the number of peaks.
  5. Construct a saturation curve and fit it to an asymptotic regression model.
#! /bin/sh
#SBATCH --job-name=saturation_curve.sh    
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 16
#SBATCH -p general
#SBATCH --qos=general
#SBATCH --mem=128G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=ssun@uchc.edu
#SBATCH -o saturation_curve.sh_%j.out
#SBATCH -e saturation_curve.sh_%j.err
hostname
mkdir temp_macs

module load macs3
module load samtools/1.12

directory=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/

# Define subsampling percentages
subsample_percents=$(seq 10 10 100)

for subsample_percent in $subsample_percents
do
    subsample_val=$(echo "scale=2; $subsample_percent / 100" | bc)
    
    # Sub-sample treatment BAM files
    for treatment_file in ${directory}*_GATA_CC*sorted.bam
    do
        nm=$(echo $treatment_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
        samtools view -s $subsample_val -b $treatment_file > subsampled_${subsample_val}_${nm}
    done
    
    # Sub-sample control BAM files
    for control_file in ${directory}*IgG*sorted.bam
    do
        name=$(echo $control_file | awk -F"/" '{print $NF}' | awk -F"MCF7_dTAGGATA522_" '{print $2}')
        samtools view -s $subsample_val -b $control_file > subsampled_${subsample_val}_${name}
    done
    
    # Call peaks using MACS3 with all subsampled treatment and control BAM files
    treatment_files=$(ls subsampled_${subsample_val}_GATA*.bam | tr '\n' ' ')
    control_files=$(ls subsampled_${subsample_val}_IgG*.bam | tr '\n' ' ')

    macs3 callpeak --call-summits -t $treatment_files -c $control_files -n GATA_ChIP_subsample_${subsample_val} -g hs -q 0.01 --keep-dup all -f BAMPE --nomodel --tempdir temp_macs

    # Clean up intermediate subsampled files if needed
    rm subsampled_${subsample_val}_*.bam
done
#module load R/4.1.2 
#R 
count_peaks <- function(file_path) {
  peak_count <- system(paste("wc -l", file_path), intern = TRUE)
  return(as.numeric(strsplit(peak_count, " ")[[1]][1]))
}

# Find all summit.bed files
file_paths <- Sys.glob("/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/Saturation_curve/*summits*.bed")

# Initialize an empty data frame
data <- data.frame(Read_Depth = numeric(0), Peak_Number = numeric(0))

# Read data and count peaks for each file
for (i in seq_along(file_paths)) {
  peak_number <- count_peaks(file_paths[i])
  data <- rbind(data, data.frame(Read_Depth = c(100, seq(10, 90, 10))[i], Peak_Number = peak_number))
}

# Plot the saturation curve with lattice
library(lattice)
pdf('240103_saturation_curve.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth, data = data, type = c("p", "smooth"), col = "blue"))
dev.off()


# Fit asymptotic regression
asymptotic_model <- nls(Peak_Number ~ asymptotic_formula(Read_Depth, a, b, c), 
                        data = data, 
                        start = list(a = mean(data$Peak_Number), b = 0.1, c = 10))
# Define the asymptotic formula (replace this with the formula you want to fit)
asymptotic_formula <- function(x, a, b, c) {
  a * (1 - exp(-b * x)) + c
}

# Plot the saturation curve with lattice and add fitted curve
pdf('240103_saturation_curve2.pdf',width=8,height=5)
print(xyplot(Peak_Number ~ Read_Depth, 
             data = data, 
             type = c("p", "smooth"), 
             col = "blue",
             xlim=c(0, 300),
             ylim=c(0, 150000),
             xlab="percent read depth (%)",
            panel = function(x, y, ...) {
              panel.xyplot(x, y, ...)
              panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
                         })
)
dev.off()


png('240103_saturation_curve2.png')
print(xyplot(Peak_Number ~ Read_Depth, 
             data = data, 
             type = c("p", "smooth"), 
             col = "blue",
             xlim=c(0, 300),
             ylim=c(0, 150000),
             xlab="percent read depth (%)",
            panel = function(x, y, ...) {
              panel.xyplot(x, y, ...)
              panel.curve(asymptotic_formula(x, coef(asymptotic_model)["a"], coef(asymptotic_model)["b"], coef(asymptotic_model)["c"]), col = "red", add = TRUE)
                         })
)
dev.off()
saturation curve for GATA3 ChIP-seq peaks

Figure 10: saturation curve for GATA3 ChIP-seq peaks

  •   Peak sensitivity and specificity using GATA3-depleted cells and untreated cells. Do sensitive peak sets have a sequence-binding element?  \

4.2 GATA3 structure

  •   Homology of two GATA3 zinc fingers -N-zinc sequence, C-zinc sequences, align them. \

make sure the full structure of zinc finger; and the specificity of each zinc finger binding to the specific sequences.
Sequence Retrieval from Uniprot P23771 · GATA3_HUMAN
zinc finger 1: 263-287: CVNCGATSTPLWRRDGTGHYLCNAC[GLY] ; zinc finger 2: 317-341: CANCQTTTTTLWRRNANGDPVCNAC[GLY]
NCBI BLAST Alignment:

NCBI blast for the two zinc finger

Figure 11: NCBI blast for the two zinc finger

The two zinc finger of GATA3 have 60% identity.

  •   What are the characteristics of the two activation domains? \
  •   What are the characteristics of GATA3 zinc fingers? \

4.3 Software to search for variable spacing motifs.

4.4 Functional check:

  •   Given purified zinc fingers and a DNA template, can we observe different spacing/orientations of binding patterns? (cryoEM or in silicon alphafold to find protein-DNA binding) \